Objective To develop an algorithm (sCOVID) to predict the risk of severe complications of COVID-19 in a community-dwelling population to optimise vaccination scenarios.
Design Population-based cohort study.
Setting 264 Dutch general practices contributing to the NL-COVID database.
Participants 6074 people aged 0–99 diagnosed with COVID-19.
Main outcomes Severe complications (hospitalisation, institutionalisation, death). The algorithm was developed from a training data set comprising 70% of the patients and validated in the remaining 30%. Potential predictor variables included age, sex, chronic comorbidity score (CCS) based on risk factors for COVID-19 complications, obesity, neighbourhood deprivation score (NDS), first or second COVID-19 wave and confirmation test. Six population vaccination scenarios were explored: (1) random (naive), (2) random for persons above 60 years (60plus), (3) oldest patients first in age band of 5 years (oldest first), (4) target population of the annual influenza vaccination programme (influenza), (5) those 25–65 years of age first (worker), and (6) risk based using the prediction algorithm (sCOVID).
Results Severe complications were reported in 243 (4.8%) people with 59 (20.3%) nursing home admissions, 181 (62.2%) hospitalisations and 51 (17.5%) deaths. The algorithm included age, sex, CCS, NDS, wave and confirmation test (c-statistic=0.91, 95% CI 0.88 to 0.94) in the validation set. Applied to different vaccination scenarios, the proportion of people needed to be vaccinated to reach a 50% reduction of severe complications was 67.5%, 50.0%, 26.1%, 16.0%, 10.0% and 8.4% for the worker, naive, influenza, 60plus, oldest first and sCOVID scenarios, respectively.
Conclusion The sCOVID algorithm performed well to predict the risk of severe complications of COVID-19 in the first and second waves of COVID-19 infections in this Dutch population. The regression estimates can and need to be adjusted for future predictions. The algorithm can be applied to identify persons with highest risks from data in the electronic health records of general practitioners (GPs).
- public health
- primary care
Data availability statement
Data are available upon reasonable request. The data will be available upon reasonable request (email@example.com).
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
This large population-based cohort study used electronic health record data of n=6074 patients with COVID-19 in n=264 Dutch general practitioner (GP) practices.
The routine electronic data of each patient with COVID-19 were enriched with a brief COVID-19 information and communication technology (ICT) system-linked questionnaire filled in by the GP.
Least absolute shrinkage and selection operator (LASSO) regression was used for prediction modelling in a train data set and a validation data set, including data up to January 2021.
Although the LASSO regression accounted for shrinkage of coefficients, the split sample method did not allow to further study model optimism.
The data collection relies on a GP registration and may overpresent patients with manifest complaints.
In the Netherlands, as in many other countries, the SARS-CoV-2 outbreak had severe consequences from March 2020 onwards. The fast spread of the infection and the unexpected severe complications required, in the absence of treatment, hospitalisation for many days in intensive care units (ICU), thereby occupying all available ICU beds in the Dutch hospitals. This urged the Dutch government to install social distancing measures including a lockdown. Although the number of hospitalisations dropped fast in the summer, a sudden increase started in August 2020 leading to a second lockdown on 15 December 2020 and a curfew on 23 January 2021. The still limited capacity of available ICU beds, the unpredictable course of the COVID-19 infections, the limited knowledge on how these infections spread among the population, the absence of proper treatments, and the in-time and location unsuspected flare-ups of infections paralysed the Dutch care system and economy.
To prioritise high-risk individuals for vaccination or shielding from corona infections, or to start treatment in primary care as soon as possible, accurate identification of patients at risk for severe COVID-19 is of utmost importance. This requires living, accurate risk prediction algorithms that are easy to apply in general practice as suggested by Clift et al.1 Initially, prediction algorithms for mortality or progression to severe disease were mainly developed for hospitalised patients.2–4 In the mean time, several prediction algorithms for patients infected with COVID-19 in the general population have been developed.1 5 6 Although the performance of these algorithms is fairly good, they have to deal with bias due to country-specific policy measures that change in time. This is in part because these studies were conducted based on the first wave of the infections, when testing was scarce and policy measures were still in its infancy.
By now, vaccines have become available and vaccination campaigns are ongoing but the shortage of vaccines limits the outroll of these campaigns.7 Efforts to prioritise risk groups for vaccination are ongoing, focusing on populations with the highest risk of COVID-19 complications.8 The development of our algorithm was aimed to provide predictions for subpopulations at risk for severe COVID-19 infections leading to hospitalisation, institutionalisation or death. The prediction algorithms are based on data of the Dutch NL-COVID database, containing nationwide geodemographical and medical data.9 Building on this algorithm, we estimated the effectiveness of six different scenarios for vaccination of high-risk persons in order to prevent severe COVID-19 complications.
This cohort study was performed by using data from an extensive and representative general practice population database in the Netherlands.
Data were obtained from general practitioner (GP) practices who reported information of the diagnoses and comorbidities of patients suffering from COVID-19 in the NL-COVID database. This database was set up in April 2020 as a collaborative initiative of general practitioners, public health specialists, virologists, epidemiologists, data scientists, data specialists, privacy specialists and information and communication technology (ICT) companies providing electronic health records (EHR). Together, the ICT companies cover about 95% of all GP practices in the Netherlands. GPs were asked to complete a brief questionnaire protocol for patients suffering from COVID-19 in their ICT systems. From a total of 264 practices (~5% of all Dutch GP practices), both questionnaire data and EHR with information regarding selected comorbidities were included in this study. Data until 21 January 2021 were used. Vaccines were not yet available during the study period.
The selected comorbidities (online supplemental appendix A) were those indicated by the National Institute for Public Health and the Environment to be relevant for the prognosis of severe outcomes of COVID-19 infections.10 The following information was collected on a daily basis: a diagnosis of COVID-19 and whether the diagnosis was confirmed with a PCR test, the severity of the infection defined as treated at home, treated in a hospital or special care institution, or death from COVID-19. Updates of the patient’s status were recorded using the same form. For this paper, we used the last status report. In addition, age, gender, body mass index (BMI), a chronic comorbidity score (CCS) and postal code were collected from the electronic registries of the GP. The neighbourhood deprivation score (NDS) was based on the quartile distribution of relative wealth of the neighbourhood as derived by Statistics Netherlands.11 There were no missing data in the NL-COVID database: questionnaire data were complete and the registration of comorbidities in the EHR was considered to be complete as well.
A cohort study was performed among patients suffering from COVID-19 symptoms registered in the NL-COVID database certified by their GP.
The primary outcome was the occurrence of severe complicated COVID-19 disease defined as hospitalisation, institutionalisation or death, as collected by the questionnaire from patients’ GP.
Predictors included age and sex, the NDS, BMI ≥30 kg/m2, the period of registration (before or after August 2020) as first of the second wave, whether the diagnosis was confirmed with a PCR test or CT scan and a CCS. The CCS was based on the chronic diseases identified as predictors for complications of COVID-19 infection by the National Institute for Public Health and the Environment.10 The comorbidities were mapped to the international classification of primary care (ICPC) coding system used in Dutch GP practices and subsequently grouped into nine disease clusters (online supplemental appendix A). A patient was scored in each of these respective disease clusters and assigned a point per cluster. For example, a patient suffering from epilepsy and diabetes scored a point in the category neurological diseases and a point for diabetes yielding a CCS of 2. The absence of registration was considered to be the absence of the disease/condition. This also applied to BMI, that is, if no record of BMI or obesity was observed then it was assumed that the patient had a normal weight.
Risk mitigation scenarios
The prediction models yield a probability that a patient with COVID-19 develops a severe complication. In a single normalised Dutch GP practice (n=2090 patients), the summarised sCOVID-predicted probability is 85. It is assumed that if all patients would be infected with COVID-19, an expected 85 patients would develop severe COVID-19 complications. We further assumed that this probability can be reset to (almost) zero by vaccination, or by shielding patients from contact with others. By vaccination or shielding of the 10 highest ranked patients, ranging from a probability of 0.64 to 0.45, 85 minus 5=80 patients were expected to develop severe COVID-19 complications, a decrease of 100*(1-80/85) of 5.9%.
For 300 randomly selected, fully anonymised GP practices from the STIZON Database Network, including data from 1.2 million inhabitants, the predicted number of patients developing severe COVID-19 complications was estimated as the summarised sCOVID probabilities as Base (B). Depending on the vaccination coverage and the policy who to vaccinate (scenario), the number of patients developing severe complications can be estimated for different vaccination scenarios.
The impact of the vaccination strategy can be followed in time by division of the summarised probabilities Pt divided by B as 100% times Pt/B yielding the percentage expected decrease in severe complications at a given percentage of the population vaccinated. The vaccination coverage needed for a 50% decrease of hospitalisation was defined as VC50 as a measure of the efficiency of a particular hypothetical vaccination or shielding scenario. We explored and compared six different hypothetical vaccination scenarios. A first scenario was defined as a naive scenario, a scenario in the absence of any policy, that is, inhabitants are randomly vaccinated. A second scenario was defined as a plus60 scenario where all inhabitants, 60 years of age or older, are randomly vaccinated, followed by random vaccination of those under 60 years of age (also random). A third scenario (oldest first) prioritised vaccination from the oldest down from 100 to 60 years of age in age band of 5 years. Within the respective age bands, allocation is random. A fourth scenario was defined as the influenza scenario. Here, patients with an indication for influenza vaccination are prioritised for vaccination. A fifth scenario (worker) prioritised random vaccination of inhabitants 25–65 years of age. The sixth and last scenario was based on the sCOVID risk-ranking algorithm, the sCOVID scenario. Here, we start vaccination based on the absolute risk ranking, the patient with the highest risk first, followed by the second patient in line, etc.
Least absolute shrinkage and selection operator (LASSO) regression analysis was used to select predictors in the model and to estimate and shrink regression coefficients. Tenfold cross-validation was used to estimate the optimal shrinkage factor (λ) used in the LASSO regression, such that the sum of the squared residuals was minimised. Age was included as quadratic function. The final regression formula allowed calculation of predicted probabilities for each registered patient at their GP. We randomly allocated 70% of the patients in a training data set to develop the model. The other 30% of the patients were allocated into a validation data set. We assessed the model performance in terms of discrimination and calibration in the validation set. Discrimination was assessed using the c-statistic. The c-statistic indicates the extent to which the model can distinguish between a patient with and without the outcome and varies between 0.5 and 1. Calibration was assessed using calibration plots showing the predicted risk against the observed frequency of the study population’s outcome using 10 risk groups. Goodness of fit was assessed with the Brier Score to quantify the difference between the observed and fitted probabilities ranging from 0 to 1, with a score of 0 representing the best model.12 With an outcome proportion of 0.05 and eight candidate predictors at least 891 patients would be needed.13 R V.4.0.2, GLMNET package (4.0-2), was used for statistical analyses and constructing figures. We adhered to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis statement.14
Patient and public involvement
Patients were not involved in the design and conduct of the study. General practitioners were consulted to reflect on their ideas about different vaccination scenarios and their practicality.
Overall study population
A total of 264 GP practices (~5% of all Dutch GP practices) reported 6074 patients with a diagnosis of COVID-19 in the period 10 April 2020 until 21 January 2021. Severe complications were reported for 291 (4.7%) patients of whom 59 (20.3%) were treated in a nursing home, 181 (62.2%) were hospitalised and 51 (17.5%) died. Training and test model included 4251 and 1823 persons, respectively.
The characteristics of patients with COVID-19 recorded in the first and second-time periods differed in age, baseline risk, frequency of testing and region. The percentage of people developing severe complications dropped from 8.5% in the first period in the Spring 2020 to 2.5% in the second-time period in the Autumn 2020 which was reflected in institutionalisation, hospitalisation and death. In the first wave, infected patients were from older age groups, whereas relatively more adolescents, 12–19 years of age, were reported in the second wave. The proportion of patients recorded with a positive COVID-19 test increased from 63% in the first wave to 95% in the second wave. The general characteristics are presented in table 1. Most of the patients with severe complications suffered from cardiovascular conditions (64.6%) and other chronic conditions such as diabetes, neurological diseases (ie, dementia, Parkinson’s disease) and lung disease. Almost 80% of patients with severe complications suffered from at least one chronic disease. More than 62% had multiple chronic conditions. The characteristics of the training and validation set are shown in table 2.
The predictor variables in the final COVID-19 models included age, sex, positive test result, period (first or second wave), NDS, obesity and the CCS (table 3). The strongest predictors included age, NDS, the time period, a positive PCR test and male sex. Obesity was eliminated by the LASSO predictor selection. The final model showed a very good calibration and fit. Figure 1 illustrates the receiver operating characteristic curve from the validation set with a c-index of 0.91 (95% CI 0.88 to 0.94). The model yielded a good calibration (Brier Score=0.034) (figure 2).
Risk prediction in practice
Examples of individual risk ranking
The risk of developing severe complications for a 60-year-old man, with a positive PCR test, living in a neighbourhood with a low NDS and who suffers from diabetes, hypertension and kidney failure can be estimated. His comorbidities comprise three different classes (online supplemental appendix A). Summarising the coefficients (Cf) from the column LASSO regression in table 3, the equation yields as total score of Cf(intercept)+60*Cf (age times age)+Cf(man)+Cf (after July 2020)+Cf (positive COVID-19 test)+Cf (low NDS)+3*Cf (CCS)=−3.456. His risk to develop severe complications is subsequently calculated as . The risk equals that of a 73-year-old woman without any chronic condition living in a neighbourhood with a high socioeconomic status.
Practice risk ranking
The results of six prioritising scenario analyses were obtained by deploying the different algorithms to 300 randomly selected, fully anonymised GP practices, including data from 1.2 million inhabitants. The results for the six scenarios are plotted in figure 3 and summarised in table 4. A reduction of 50% of the patients with severe complications was observed already with a vaccination coverage of 8% if all high-risk persons according to the sCOVID algorithm are vaccinated first. This scenario was superior to all other scenarios with vaccination scheme in which the oldest are consecutively vaccinated in age band of 5 years, being second best. The worst scenario was the worker scenario prioritising patients 25–65 years of age, followed by the naive scenario where patients are randomly vaccinated.
Using data from the NL-COVID database, an algorithm was developed to predict the probability of patients developing severe complications once infected with COVID-19 using EHR from general practices. This sCOVID algorithm, which can be deployed in all Dutch GP practices, showed a very good performance in terms of discrimination (c-index: 0.91) and calibration and can be used to rank the most susceptible patients for prioritisation of vaccinations. Our vaccination scenarios showed that ranking and vaccinating patients based on their complication risk (sCOVID scenario) would be the most efficient vaccination scenario to reduce hospitalisation and deaths. The second-best scenario was to vaccinate the oldest people first in consecutive order. With shortage of vaccines, the most vulnerable patients and not the oldest patients are prioritised.
Comparison with other studies
The sCOVID risk models yield a high discrimination rate (c-statistic=0.91). Calibration plots show a good fit in all risk categories, although the lowest risks were most challenging to estimate due to the limited numbers of patients developing severe complications. These results are similar to those of other prediction algorithms. Two earlier studies developed prediction algorithms for hospitalisation and/or death due to COVID-19 infection and showed similar prognostic performance.1 5
The major predictors, selected by the LASSO procedures, were higher age, male gender, the number of chronic comorbidities but also a positive test result and neighbourhood deprivation status. These selected predictors resemble the predictors reported in earlier studies by Clift et al, Jehi et al and Williamson et al.1 5 6 The most obvious differences were the summary score of comorbidities (CCS) compared with separate conditions and inclusion of symptoms and laboratory measures for the study by Jehi et al. Most predictors found in this study relate to poor health and a complex of comorbidities. More than 60% of the patients with severe complications suffered from more than one chronic condition against less than 20% of those without comorbidities. Therefore, we preferred to include a chronic disease summary score to come to a more comprehensive and practical algorithm. Moreover, from a clinical perspective, our sample size was relatively small and would exclude rare but clinically relevant outcomes.
Complexity of modelling
Estimating the risk of severe COVID-19 complications is permanently subject to changing policy measures and interventions to shield high-risk people by vaccinations.7 8 The time biases caused by these measures and interventions are complex and difficult to unravel. First analyses confirmed suggestions from Clift et al that these time biases are indeed present,1 showing an age and sex-adjusted three to five times lower complication rate compared with the patients in the first wave. Estimates, needed to predict hospitalisation and/or death, therefore need permanent recalibration of the prediction algorithms. Such recalibration is necessary to monitor the effect of policy intervention on managing care capacity. The infrastructure of the NL-COVID database permits the recalibration on a regional and daily basis.
Strengths and limitations
The NL-COVID database also has limitations and strengths. First, we have substantial under-reporting of positive cases since our 264 registration practices consisting of about 5% of all GP practices only reported 0.7% of the registered cases. This is explained by several factors: first, practices enrolled into the programme over time and some practices only joined the programme and the end of 2021. Second, COVID-19 testing was done by the regional health authorities whereas the administrations of the regional health authorities were not linked with the GP administration. Therefore, our registration relies on whether the patient contacted the GP and whether the GP registered the patient. This makes it likely that we have a selection bias towards the more severe disease manifestations of the COVID-19 infection. Also, our prediction partly relied on the judgement of the GP whether a patient was COVID-19 positive (in case of lacking test results). It should therefore be stressed that absolute risk estimates of severe complications should be interpreted with care only by healthcare professionals for prioritising strategies. A weakness of the sCOVID scenario is that we did not perform an external validation and that the model was not retrained in a random sample of the general population. The large number of GP practices that came from all over the country and the good testing characteristics of the validation set makes it likely that the accuracy of the scenarios is adequate. For the comparison of the different scenarios this has no importance since they were compared in the same sample.
A first strength was the coverage and representativity of the practices most strongly confronted with the pandemic. The first wave of COVID-19 hit hard in the southern part of the country and most participating practices were situated here. Second, we used training and validation samples to estimate the accuracy of the algorithms. Third, by law and regulation, almost every Dutch citizen has a designated GP and therefore we were able to study the general population. Fourth, this study is the first to demonstrate the potential impact and efficiency of more and less targeted vaccination scenarios. Fifth, the prediction algorithm can be adapted, updated and validated on a daily basis and learn from new insights and policy measures.
Our study showed that within the framework of privacy regulations, COVID-19 infections and consequences can be monitored fast, efficiently and safely on a very detailed local level and on a day-to-day basis using country-wide data from currently available ICT systems in GP practice. The costs of such a database are relatively low. Insights can be generated that help GPs and involved regional and local health authorities to shield the patients from infection and to reduce hospitalisation and death very efficiently in a selected group of persons with the highest risks. Moreover, such database may demonstrate and underpin the effectiveness and efficiency of policy measures to plan and manage care facilities. Second, the prediction accuracy could be improved with flexible access to the complete GP patient dossier and linkage to hospital admission under strict compliance with the general data protection regulation to adapt and improve the algorithms if new insights become available. The vaccination scenarios did not fully address the complexity of the real world. For instance, the scenarios assumed that vaccination is always effective, and did not consider the time needed to vaccinate the population. Furthermore, implementation requires embedding in guidelines and acceptance by general practitioners to be used in current practice and need to be weighed against social and political measures. Therefore, the vaccination coverages needed for a 50% reduction may be underestimated. However, the scenarios showed that in case of remaining shortage of vaccines, vaccination based on the sCOVID scenario performs best with a consecutive age-based scenario as second best. Hybrid scenarios that do not follow the risk of COVID-19 complications have worse performances, for example, the influenza scenario in which age and influenza risk are combined. Currently, vaccination in the Netherlands is performed from a practical perspective based on factors not only related to COVID-19 risk complications. This makes it likely that more efficient scenarios are thinkable.
In conclusion, the sCOVID algorithm has been developed to predict which patients are at high risk to develop severe complications due to COVID-19 and showed a good model performance. In remaining shortage of vaccines, prioritising vaccination of patients based on sCOVID risk complications is the most efficient way to reduce hospitalisations, institutionalisations and death.
Data availability statement
Data are available upon reasonable request. The data will be available upon reasonable request (firstname.lastname@example.org).
Patient consent for publication
Patients and GPs were asked to consent with data sharing regarding sending data to the NL-COVID database and extract information on the four-digit postal code level in an anonymised format for public decision-making. The procedure was approved and tested for compliance with the General Data Protection Regulation by the Institutional Review Board of ‘Stichting Informatievoorziening voor Zorg en Onderzoek’ (STIZON, ID 10042020).
The authors like to thank the unconditional support of Guus Vaassen and Johan Ruiter (Medworq), Marjoleine van der Zwan, Piet-Hein Knoop, Arjan den Ouden (PharmaPartners), Mark van Vliet, Chris Tromp (Health Base), Eric Grosveld, Meefa Hogenes (ExpertDoc), and Frank Carlebur (ZONH), Ernst de Graag, Michiel Meulendijk (STIZON), Theo Peters (CGM) and more than 450 Dutch general practitioners who contribute to the COVID database in time, cash or in kind to fight COVID-19.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Contributors RMCH and EGH were involved in the development of the database. RMCH, KMAS, BAMvdZ, AAvdH, KvdV, HvH, JWJB, GN and PJME contributed to the development of the research question and study design. RMCH and RARH conducted the statistical analyses. MWH was involved in advanced statistical aspects. RMCH, KMAS, BAMvdZ, AAvdH, KvdV, EGH, MWH, RARH, HPJvH, JWJB, GN and PJME contributed to the interpretation of the results. RMCH wrote the first draft of the manuscript. RMCH, KMAS, BAMvdZ, AAvdH, KvdV, EGH, MWH, RARH, HPJvH, JWJB, GN and PJME contributed to the critical revision of the manuscript for important intellectual content and approved the final version of the manuscript. RMCH is the guarantor of the study.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests RMCH is the director of STIZON, an organisation that is the processor of the NL-COVID data on behalf of the participating general practitioners. EGH is an employee of Health Base, an independent multidisciplinary foundation active in developing content for medical and pharmaceutical decision support systems, which was also applied for the NL-COVID database. KMAS is an employee of the PHARMO Institute for Drug Outcomes Research. This independent research institute performs financially supported pharmacoepidemiological studies for the government, healthcare authorities and pharmaceutical companies.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.