Article Text

The performance of seven QPrediction risk scores in an independent external sample of patients from general practice: a validation study
  1. Julia Hippisley-Cox1,
  2. Carol Coupland1,
  3. Peter Brindle2
  1. 1Division of Primary Care, Nottingham, UK
  2. 2Avon Primary Care Research Collaborative, Bristol Clinical Commissioning Group, Bristol, UK
  1. Correspondence to Professor Julia Hippisley-Cox; Julia.hippisley-cox{at}nottingham.ac.uk

Abstract

Objectives To validate the performance of a set of risk prediction algorithms developed using the QResearch database, in an independent sample from general practices contributing to the Clinical Research Data Link (CPRD).

Setting Prospective open cohort study using practices contributing to the CPRD database and practices contributing to the QResearch database.

Participants The CPRD validation cohort consisted of 3.3 million patients, aged 25–99 years registered at 357 general practices between 1 Jan 1998 and 31 July 2012. The validation statistics for QResearch were obtained from the original published papers which used a one-third sample of practices separate to those used to derive the score. A cohort from QResearch was used to compare incidence rates and baseline characteristics and consisted of 6.8 million patients from 753 practices registered between 1 Jan 1998 and until 31 July 2013.

Outcome measures Incident events relating to seven different risk prediction scores: QRISK2 (cardiovascular disease); QStroke (ischaemic stroke); QDiabetes (type 2 diabetes); QFracture (osteoporotic fracture and hip fracture); QKidney (moderate and severe kidney failure); QThrombosis (venous thromboembolism); QBleed (intracranial bleed and upper gastrointestinal haemorrhage). Measures of discrimination and calibration were calculated.

Results Overall, the baseline characteristics of the CPRD and QResearch cohorts were similar though QResearch had higher recording levels for ethnicity and family history. The validation statistics for each of the risk prediction scores were very similar in the CPRD cohort compared with the published results from QResearch validation cohorts. For example, in women, the QDiabetes algorithm explained 50% of the variation within CPRD compared with 51% on QResearch and the receiver operator curve value was 0.85 on both databases. The scores were well calibrated in CPRD.

Conclusions Each of the algorithms performed practically as well in the external independent CPRD validation cohorts as they had in the original published QResearch validation cohorts.

  • Qresearch
  • Cprd
  • Qrisk2
  • Prognosis
  • Validation

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This is the first external validation of a set of QPrediction scores on the Clinical Research Data Link (CPRD). It is important since CPRD represents a fully independent sample of patients registered with general practices using a different clinical computer system from that used to derive the algorithms.

  • The discrimination and calibration statistics for each score were very similar in CPRD to those published from validation cohorts from QResearch. This supports their potential utility in the general population of patients in primary care.

  • A strength of using CPRD for risk score validation is that the risk score can be assessed using data collected in a similar manner to the data that would be used when the risk score is used in clinical practice.

  • The difficulty of obtaining a comprehensive code list for any given outcome or exposure is a limitation common to all research in primary care databases. We mitigated this by matching our code lists for the CPRD primary analysis to the code lists in the QResearch derivation data set wherever possible.

  • Further research is needed to evaluate the clinical outcomes and cost-effectiveness of using these algorithms in primary care.

Introduction

In the past 7 years, we have developed a series of risk prediction algorithms using the QResearch database. QResearch is a large research database containing pseudonymised individual level data from over 700 general practices using the Egton Medical Information Systems (EMIS) clinical system. The QResearch database consists of data collected from primary care (coded information on sociodemographic characteristics, diagnoses, symptoms, smoking/alcohol, clinical measurements, laboratory values, prescriptions and referrals) which has been linked to cause of death, hospital episodes and cancer registrations at individual patient level.

The algorithms predict outcomes such as cardiovascular disease (http://www.qrisk.org),1 stroke (http://www.qstroke.org),2 type 2 diabetes (http://www.qdiabetes.org),3 osteoporotic fracture (http://www.qfracture.org),4 moderate or severe kidney disease (http://www.qkidney.org),5 venous thromboembolism (VTE; http://www.qthrombosis.org)6 and emergency hospital admission (http://www.qadmissions.org).7 Generally, the ‘QPrediction’ algorithms have been designed to systematically identify patients in primary care at high risk of a serious clinical outcome for whom further intervention to lower risk of that outcome might be possible. They are also designed to quantify absolute risk of serious outcomes in a way which patients can understand and which might help guide lifestyle and management decisions. A number of these algorithms are now integrated into general practitioner (GP) clinical computer systems, included in national guidelines1 ,4 and are in daily use across the National Health Service (NHS).1 ,3 ,8

The algorithms were originally developed using a random two-thirds sample of practices contributing to the QResearch database and validated on the remaining third. While this represents a physically discrete population of patients and practices for validation, the practices all use the same clinical computer system (EMIS), which is in use in 53% of UK practices. A more stringent test of performance is to validate the algorithms on a fully external database derived from practices using a different but commonly used primary care computer system. This would help determine whether the predictions from the algorithms are likely to generalise to the whole population in England. While some of the algorithms have been validated by an independent team using the Health Improvement Network (THIN) primary care database,9–12 there are currently no published validations of the algorithms using a primary care database which is routinely linked to mortality data in the same way as QResearch.

We therefore decided to validate the various QPrediction scores using another database known as the Clinical Research Data Link (CPRD). The General Practice Research Database (GPRD) was originally set up in 1988 and is of similar nature to QResearch although it is derived from practices using a different clinical computer system (Vision, which is used by 20% of GPs). It was extended to include linked mortality data and data from secondary care and was renamed the CPRD in 2012. Our secondary objective was to compare the ascertainment of incident clinical events recorded in GP data alone with that recorded in either GP data or the linked mortality data in the CPRD and QResearch.

Methods

CPRD study population

For the validation using CPRD, we identified an open cohort of patients aged 25–99 years at entry to the cohort and followed this cohort up until 31 July 2012 (the latest date for which linked data were available at the time of analysis). We restricted the CPRD cohort to 357 practices in England which had linked Office for National Statistics (ONS) mortality and hospital admissions data. For each patient we determined an entry date to the cohort, which was the latest of the following dates: 25th birthday, date of registration with the practice plus 1 year, date on which the practice computer system was installed plus 1 year and the beginning of the study period (1 January 1998). Patients were censored at the earliest date of the relevant outcome, de-registration with the practice, last upload of computerised data or the study end date (31 July 2012).

For the assessment of the two QBleed outcomes (intracranial bleed and upper gastrointestinal haemorrhage) we used a later cohort entry date of 1 January 2007 for comparability with the equivalent study period for the derivation of the algorithm on QResearch.13

QResearch study population

For comparison of the validation statistics (receiver operator curve (ROC), D and R2 statistics), we extracted the original published values from the papers which had been calculated using a one-third sample of practices from QResearch which were independent from the two-thirds of practices used to derive the scores.

For comparison of the baseline characteristics, incidence rates and ascertainment rates we used the latest version of the QResearch database which is currently available (QResearch 38, 31t December 2013). We identified an open cohort in the same way as for CPRD, using all of the QResearch practices in England, and with follow-up until 31 July 2013.

Inclusion and exclusion criteria

For both databases, we excluded patients without a Townsend score (an area-based measure of material deprivation derived from the post code) and temporary residents. For each score we then identified patients who were eligible to have the score calculated according to the relevant inclusion and exclusion criteria as summarised in table 4.

Risk scores included in validation

We validated the following risk prediction scores on CPRD:

  1. QDiabetes—10-year risk of type 2 diabetes3;

  2. QRISK2—10-year risk of cardiovascular disease1;

  3. QStroke—10-year risk of stroke or transient ischaemic attack (TIA)5;

  4. QFracture—10-year risk of hip or osteoporotic fracture4;

  5. QThrombosis—5-year risk of VTE6;

  6. QBleed—5-year risk of upper gastrointestinal haemorrhage and intracranial haemorrhage13;

  7. QKidney—5-year risk of moderate-severe kidney disease.5

Clinical outcomes

We identified the relevant clinical outcome using the same definition as had been applied in the original derivation of the risk scores using QResearch. The data sources used to identify the clinical outcomes had varied over the 6 years during which the original studies had been undertaken due to the changing availability of linked hospital and mortality data over that time. In 2008, the QResearch database was linked to mortality records for 1997 onwards. In 2013, the QResearch database was linked to hospital admissions records with data for patients from 1998 onwards. For the latest updated version of QRISK2 (QRISK2, 2014), the outcome was identified by the presence of the relevant Read code on the GP record or an International Classification of Diseases (ICD)10 code recorded on the linked mortality record or on the linked hospital admissions record. For QStroke, QDiabetes, QFracture and QThrombosis, the outcome was identified either by the presence of the relevant Read code recorded on the GP record or an ICD10 code recorded on the linked mortality record. For QKidney, the outcome was identified solely from information recorded in the GP record as in the original study as it required blood test values which were only present in the GP record. For QBleed, the outcome was identified in CPRD from events recorded either on the linked hospital admissions database or the linked mortality record in order to identify the events most likely to have serious clinical consequences for the patient.

We determined case ascertainment for each clinical outcome on both databases, by calculating the proportion of cases recorded on the GP record out of the total number of cases recorded on either the GP record or linked mortality record. We calculated the age standardised incidence rates of each outcome based on outcomes recorded on (1) the GP record alone and on (2) the GP record or linked mortality; (3) GP or linked mortality or hospital records. We standardised CPRD rates to the age distribution of the QResearch population in 5-year bands to ensure comparability.

Risk factors and missing values

We extracted data from CPRD for all the predictor variables included in one or more of the different algorithms using the same definitions as those used in the original QResearch studies to enable a direct comparison of the results. We developed a mapping between the Read and medication reference tables to identify the equivalent code in each database. This included the following variables recorded at entry to the cohort:

  • Demographics—age (continuous), sex, ethnicity (9 categories—Caucasian, Indian, Pakistani, Bangladeshi, Other Asian, Black Caribbean, Black African, Chinese, other ethnic group), resident in care home, material deprivation (as measured by the Townsend score).

  • Clinical values—smoking status (non-smoker, ex-smoker, light smoker (1–9 cigarettes/day), moderate smoker (10–19 cigarettes/day), heavy smoker (20+ cigarettes/day); body mass index (BMI), systolic blood pressure, alcohol consumption—non-drinker, trivial (<1 u/day), light (1–2 u/day), moderate (3–6 u/day), heavy (7–9 u/day), very heavy (>9 day).

  • Laboratory results—cholesterol/high-density lipoprotein (HDL) ratio, platelets.

  • Family history—family history of osteoporosis or hip fracture in a first degree relative, coronary heart disease in first degree relative under the age of 60 years, diabetes in a first degree relative.

  • Chronic diseases—congestive cardiac failure, atrial fibrillation, coronary heart disease, cardiovascular disease, peripheral vascular disease, VTE, diabetes, rheumatoid arthritis, systemic lupus erythematosus (SLE), hypertension, renal disease, renal stones, inflammatory bowel disease, dementia, Parkinson's disease, epilepsy, cancer, chronic liver disease or pancreatitis, oesophageal varices, prior haemorrhage, malabsorption endocrine diseases, asthma or chronic obstructive pulmonary disease, history of falls, prior osteoporotic fracture, varicose vein surgery, emergency admissions or hip surgery in past 6 months.

  • Prescribed medication—anticoagulants, antidepressants, antipsychotics, antiplatelets, oral non-steroidal anti-inflammatory drugs, tamoxifen, oestrogen containing hormone replacement therapy (British National Formulary, chapter 6.4.1.1), systemic corticosteroids, combined oral contraceptive.

The combination of predictor variables required for each risk score varied with the score being validated as shown in table 1. We used the clinical value recorded closest to the date on which the patient entered the study for BMI, systolic blood pressure, smoking status, platelets, and total and HDL cholesterol. Patients were considered to be exposed to medication at entry to the cohort if they had at least two prescriptions for the relevant medication prescribed prior to the study entry date with the most recent one occurring within 28 days of the study entry date.

Table 1

Summary of QPrediction scores including outcome and predictor variables

Townsend scores

We used the Townsend score evaluated at output area as a proxy for material deprivation. The CPRD data set differs from the QResearch data set in that each patient in the CPRD data set is allocated to a 10th of deprivation (as measured by the Townsend score) and only the category number is provided. In contrast, each patient in the QResearch data set is allocated the individual Townsend score corresponding to their output area of residence (ie, continuous data). In order to calculate risk scores in the CPRD cohort, we used the median value for each 10th as supplied by CPRD. Patients with missing Townsend scores were excluded from the cohorts.

Discrimination and calibration statistics

We used chained equations with the ice chained equations (ICE) procedure in STATA14 to perform multiple imputation to replace missing values for BMI, systolic blood pressure, smoking status, alcohol, and total and HDL cholesterol. We created five multiply imputed data sets and used Rubin's rules to combine effect estimates and SEs to allow for the uncertainty due to imputing missing data.15 ,16

We applied the algorithm for each score to eligible patients in the CPRD study cohort to obtain predicted risks for each of the relevant clinical outcomes. We calculated the estimated risk for eligible patients in the CPRD validation data set over 5 or 10 years depending on which score was used. We then tested the performance of each score in the CPRD cohort and compared it with the published results from the original QResearch validation cohorts.

In order to assess calibration (ie, degree of similarity between predicted and observed risks), we calculated the mean predicted risk and the observed risk17 obtained using the Kaplan-Meier estimate and compared the ratio of the mean predicted risk to the observed risk for patients in the validation cohort in each decile of predicted risk. We calculated the area under the ROC statistic to assess discrimination (ie, ability of a risk prediction equation to distinguish between those who do and do not have an event during the follow-up period). We also calculated the D statistic18 and an R2 statistic derived from the D statistic19 which are measures of discrimination and explained variation appropriate for survival models. The D statistic has been developed as a new measure of discrimination specifically for censored survival data, higher values indicate improved discrimination, and an increase in the D statistic of at least 0.1 indicates an important difference in prognostic separation between different risk classification schemes. The R2 statistic derived from the D statistic is a measure specific to censored survival data—it measures explained variation in time to the outcome event and higher values indicate more variation is explained.20 We also repeated the assessment of discrimination by restricting the analysis for each score to patients without missing data for relevant clinical or laboratory measures used in the risk score (ie, those with complete data for all predictor variables in the risk score).

We identified the proportion of patients in the CPRD validation cohort who were in the top decile of predicted risk and used this to calculate the sensitivity, specificity and observed risk at this threshold. We used the top decile for comparability across the scores and with previous studies though the choice of threshold for use in clinical practice will depend on the context and cost-effectiveness of relevant interventions. Analyses were conducted using Stata (V.13.1).

Sample size estimation

There is currently no clear guidance on sample size requirements for studies evaluating the performance (validation) of a multivariable risk score, but a commonly used rule-of-thumb is that it is desirable to seek a data set with at least 100 patients with the outcome of interest. We used all the available data on the CPRD to maximise the power of the study.

Results

Study populations

The CPRD validation cohort consisted of 3.3 million patients, aged 25–99 years registered at 357 general practices with linked data between 1 January 1998 and 31 July 2012. The QResearch cohort consisted of 6.8 million patients from 753 practices with linked data, registered between 1 January 1998 and until 31 July 2013. The numbers of patients in each geographical region are shown in web extra table 1.

Baseline characteristics

Table 2 shows a comparison of the demographic characteristics for the CPRD and QResearch cohorts.

Table 2

Comparison of baseline characteristics of patients in CPRD validation cohort and QResearch comparison cohort

The QResearch population was marginally younger with 34.2% of women and 32.8% of men aged 24–34 years compared with 27.8% and 26.9% for CPRD.

Recording of ethnicity

QResearch had a higher proportion of patients with self-assigned ethnicity recorded compared with CPRD both overall (58.2% vs 38.1%) and in each of the 10 geographical areas within England (web extra table 2). We repeated the analysis restricting information on QResearch to that recorded prior to 31 July 2012 (for comparability with the calendar time available on CPRD). Of the 6 758 649 patients in the QResearch cohort, 3 856 244 (57.1%) had ethnicity recorded prior to this date.

Recording of family history

Recording of a positive family history of coronary heart disease and diabetes was more than twice as high in QResearch compared with CPRD. For example, for family history of coronary heart disease, 11% of patients had a value recorded for QResearch compared with 4.6% for CPRD (web extra table 2). Restricting information to that recorded prior to July 2012 for QResearch, then 6 758 649 (10.7%) had a positive family history of coronary heart disease recorded.

Recording of alcohol and smoking levels

Recording of alcohol levels was very similar in QResearch and CPRD. For example, 82.1% of women had alcohol level recorded in both databases. Recording of smoking status was marginally higher in women compared with men in QResearch (93.2% vs 89.1%) and also CPRD (94.8% vs 90.8%).

Recording of clinical values

Recording of cholesterol/HDL ratio was marginally higher on QResearch compared with CPRD (40.1% vs 36%). Recording of BMI and systolic blood pressure tended to be marginally higher on CPRD than QResearch. However, the mean values for the various clinical values (BMI, systolic blood pressure, serum creatinine and cholesterol/HDL ratio) were extremely similar.

Table 3 shows prescribed medication and clinical diagnoses recorded in patients on or prior to entry to the study cohort. Overall, the prevalence of clinical diagnoses was similar on the two databases with CPRD having marginally higher prescribing rates.

Table 3

Prescribed medication and clinical diagnoses recorded at baseline in CPRD validation cohort and QResearch comparison cohort

The inclusion and exclusion criteria for each risk score are shown in table 4 along with the numbers of patients eligible for each analysis on CPRD. For example, there were 3 177 192 patients aged 25–84 years. Of these, 99 189 had existing diabetes at baseline leaving 3 078 003 for the validation of QDiabetes. Table 4 also shows the numbers and percentage out of those eligible for inclusion with complete data for risk factors necessary for calculation of the score which would otherwise need to be imputed (ie, laboratory or clinical values). The amount of missing data varies substantially between the scores with scores requiring multiple laboratory or clinical values (such as QRISK2) having the lowest levels of completeness.

Table 4

Numbers of patients eligible for each score in the Clinical Research Data Link validation cohort and number of patients with complete risk factor recording not requiring multiple imputation

Comparison between CPRD linked and unlinked data

Web extra table 3 shows characteristics for CPRD cohort with linked data with CPRD cohort without linked data. The CPRD cohort with linked data tended to have higher recorded of ethnicity compared with the CPRD cohort without linked data (38.1% vs 28.4%). Recordings of smoking, alcohol, BMI, systolic blood pressure, cholesterol and platelets were all higher on the CPRD cohort with linked data than those without linked data.

Incidence rates of clinical outcomes

Table 5 shows the number of incident events for each clinical outcome in women recorded on GP data and those recorded on either GP data or cause-specific mortality data for both the CPRD and QResearch cohorts. It also shows the age standardised incidence rates per 1000 person years. Table 6 shows the comparable information for men.

Table 5

Comparison of age standardised incidence rates (95% CI) per 1000 person years for outcomes on CPRD versus QResearch database in women

Table 6

Comparison of age standardised incidence rates (95% CI) per 1000 person years for outcomes on CPRD versus QResearch database in men

For example, there were 35 617 incident ischaemic stroke or TIA events for women on CPRD. Of these, 32 283 had been identified on the GP record with an additional 3334 events identified on the linked ONS mortality record. The ascertainment of events on the GP record was therefore 32 283/35 617, that is, 90.6%. For QResearch, there were 70 477 incident stroke events recorded on either the GP or linked ONS mortality record of which 63 572 had been identified on the GP record. The ascertainment was therefore 90.2%.

For thromboembolism in women, 91.1% of events recorded on either the GP or linked ONS mortality record on CPRD were identified on the GP record compared with 90.6% for QResearch. Similar results were obtained for men with levels of ascertainment between the two databases being extremely close suggesting similar recording patterns between the two groups of GP practices contributing to each database.

The age standardised incidence rates of events on CPRD tended to be marginally lower than those on QResearch as shown by the ratio of the CPRD rates to those in QResearch (table 5). For example, the rate ratio for fractured neck of femur in women was 0.94 indicating that CPRD had a 6% lower incidence rate compared with QResearch. The effect was more marked for moderate or severe kidney failure where the incidence rates for CPRD were approximately 25% lower than those for QResearch in women and 16% lower in men.

The age standardised incidence rates of upper gastrointestinal haemorrhage and intracranial haemorrhage among patients prescribed anticoagulants and those not prescribed anticoagulants are shown in web extra table 4. The rates are similar for CPRD and QResearch.

Validation statistics

Table 7 shows the discrimination statistics for each score in CPRD in men and women and also the published values from previous validations using QResearch. The validation statistics for each of the risk prediction scores were very similar in the CPRD cohort compared with results from QResearch validation cohorts. For example, in women, the QDiabetes algorithm explained 50% of the variation within CPRD compared with 51% on QResearch. The D statistic for women was 2.03 within CPRD compared with 2.08 for QResearch. The ROC value for women was 0.85 on both databases.

Table 7

Performance of QPrediction scores on the CPRD validation cohort compared with published results for the QResearch validation cohort

Of all the scores, QFracture (fractured neck of femur) had the best performance in men in CPRD with a ROC value of 0.89, R2 value of 71% and D statistic of 3.17. The corresponding figures for QResearch in men were 0.89, 72% and 3.26.

QThrombosis had the lowest values for men in CPRD with an ROC value of 0.77, R2 of 34.5 and D statistic of 1.49. The corresponding figures for men in QResearch were 0.75, 33.5 and 1.45.

Figure 1A–J compares the mean predicted risks and observed risks for each score across each 10th of predicted risk (1 representing the lowest risk and 10 the highest risk) and demonstrates that the models are generally well calibrated for patients on CPRD.

Figure 1

Calibration of each QPrediction score comparing the mean predicted risks with the observed risks in the CPRD cohort. (A) QThrombosis (venous thromboembolism). (B) QFracture (hip). (C) QFracture (hip, colles, spine, shoulder). (D) QStroke (ischaemic stroke). (E) QDiabetes (type 2 diabetes). (F) QBleed (upper gastrointestinal haemorrhage). (G) QBleed (intracranial haemorrhage). (H) QKidney (moderate or severe kidney failure). (I) QKidney (severe kidney failure). (J) QRisk2 (cardiovascular disease). CPRD, Clinical Research Data Link; CKD, chronic kidney disease; CVD, cardiovascular disease.

The QKidney score (moderate or severe kidney failure) showed the observed risk was lower than the predicted risk. This might indicate a degree of over prediction of the score. Alternatively, it could be related to the lower incidence rate of kidney failure observed among women on the CPRD compared with QResearch.

Web extra table 5 presents the ROC, D and R2 statistic for each score restricted to patients from CPRD with complete recording of laboratory and risk factor data for each score. The results were very similar to the results obtained using multiply imputed data set for the majority of scores except for QRISK2 and QStroke where values were lower. For example, the results for QFracture (hip fracture) in women on CPRD using multiply imputed data were ROC of 0.89, R2 of 70.6% and D statistic of 3.17. The corresponding results restricted to women on CPRD with complete data were 0.9, 70.4% and 3.16. For QRISK2, the results for women for imputed data on CPRD were ROC of 0.88, R2 of 56.4% and D statistic of 2.33. The corresponding results for complete data were 0.79, 40.9% and 1.7.

Performance for the top decile of risk

Table 8 shows the sensitivity, specificity and observed risk for patients in the top decile of each score on CPRD. The observed risk is higher than the risk threshold value since this represents the observed risk within the top decile of predicted risk. For example, the cut-off for the top 10th of risk for QFracture (fractured neck of femur) was a 10-year risk of 3.7%. At this threshold the sensitivity was 66.5%, specificity 90.4% and observed risk 9.4%. The results are similar to those obtained from QResearch (not shown).

Table 8

Performance of each score for predicting the relevant outcome in the CPRD validation cohort. The cut-off is the threshold of predicted risk for the top decile in the CPRD cohort

Discussion

Summary of key findings

This is the first external validation of a set of QPrediction scores on the CPRD. It is important since CPRD represents a fully independent sample of patients registered with general practices using a different clinical computer system (Vision system supplied by In Practice Systems) from the QResearch database (which is based on practices using EMIS clinical systems). Practices using the Vision system together with practices using EMIS make up approximately 75% of all the English general practices. The discrimination and calibration statistics for each score were remarkably similar in CPRD to those published from validation cohorts from QResearch. Our paper also provides updated information on a direct comparison between two of the world's largest general practice databases which have both been linked to mortality and second care data.

Before a clinical risk score can be reliably used in clinical practice, evidence is needed that it can successfully predict the intended outcome in groups of patients other than ones used to develop the score but similar to ones in whom the score might be used. Not all risk scores perform well in external samples—this can be due to deficiencies in the design or modelling methods used to derive the algorithm, if the model is over fitted or if there is an important predictor which is absent.21 Other reasons for poor performance include differences between the setting of patients in the new and derivation samples, differences in how information is recorded and differences in patient characteristics.21 It is for these reasons, that we have meticulously assembled the CPRD cohort using the same inclusion/exclusion criteria, definitions of predictor and outcome variables as in the original derivation studies. Any differences observed are therefore more likely to be due to capture of information and underlying population characteristics. In this study, we have found marginal differences in incidence rates between QResearch and CPRD and higher rates of recording of family history and ethnicity in QResearch though these have not been large enough to materially affect our results.

Strengths and limitations

One strength of using CPRD for risk score validation is that the risk score can be assessed using data collected in a similar manner to the data that would be used when the risk score is used in clinical practice. CPRD only had Townsend score for patients recorded for approximately half their practices (unlike QResearch where Townsend score is included for all practices), so we had to limit the validation cohort in CPRD for this analysis to those practices with linked Townsend scores. We undertook a comparison between patients registered with CPRD practices with and without linked data. We found marginally higher recording for ethnicity, smoking, alcohol, clinical values for the CPRD cohort with linked data compared with the unlinked data but similar characteristics for demographics, comorbidities, medication and clinical values (results not shown), so we have no reason to believe this would have biased our results.

Another strength of general practice databases is the large volume of patients who tend to be representative of the general population. A limitation of routinely collected data is that not all patients will have all clinical and laboratory data recorded leading to missing data values in some of the parameters needed to calculate the risk scores. We have reported performance in all patients using multiple imputation to replace missing values and restricted to patients without missing values and found very similar results for the majority of algorithms tested. There was some degradation of performance for algorithms, particularly for QRISK2 and QStroke, where there were large amounts of missing data. However, in clinical practice, the risk scores can be calculated using information recorded during consultation reducing the amount of missing data. Alternatively, the software which implements QPrediction scores includes algorithms which estimate BMI, systolic blood pressure and cholesterol/HDL ratio. The estimated values can be used where the relevant data are not recorded in order to generate an estimated risk score. Effectively, the software emulates the multiple imputation used in our validation which then gives the results based on multiply imputed data reasonable face validity.

The difficulty of obtaining a comprehensive code list for any given outcome or exposure is a limitation common to all research in primary care databases. We mitigated this by matching our code lists for the CPRD primary analysis to the code lists in the QResearch derivation data set wherever possible. The CPRD database uses the same clinical coding system as QResearch for clinical values (it uses Read V.2). However, there is a third clinical system in use in England (SystmOne) which uses a different coding system known as Clinical terms V.3 (CTV3). While there is a mapping between Read codes and CTV3, we have not tested the algorithms on a database using CTV3 in this study so are unable to draw conclusions regarding the generalisability of the results of the validation to practices using this system.

The quality of information on CPRD is likely to be good since previous studies have validated similar outcomes and exposures and found levels of completeness and accuracy to be good.22 ,23

Comparison with other studies

The aim of this study was to validate a collection of QPrediction tools. The details of the derivation and first validation of each prediction tool have been separately published in the peer-reviewed literature including information on definitions of predictor variables with supplementary information available on the relevant websites. We have not duplicated information in the present paper but have provided the relevant links and references.

Our validation results confirm earlier studies undertaken on the THIN database (another general practice database which is derived from the Vision system but which is not linked to mortality data). These earlier studies include external validations of QRISK2,10 ,11 ,24 QDiabetes,12 QFracture9 and QKidney25 by an independent team who were not involved in the development of the algorithms. These independent validations have demonstrated similar performance compared with the validations performed by study authors using the QResearch database. This study builds on previous validations by providing new information on the performance of scores not previously validated on an external database (QBleed and QThrombosis) and by utilising the linked data which were not available on the THIN database. Together with the present study (which includes a number of scores not previously tested in an external population), the results provide consistent evidence that these QPrediction scores are likely to provide appropriate estimates of disease risk in contemporary primary care populations in England and to discriminate between patients at different levels of risk with reasonable reliability.

Comparison of QResearch and CPRD baseline characteristics

Overall, our results show a striking similarity between CPRD and QResearch cohorts for nearly all baseline characteristics. There are two notable exceptions. First, recording of ethnicity was higher in QResearch than CPRD. Second, fewer patients in the CPRD cohort had a recorded family history of diabetes and coronary heart disease in a first degree relative under the age of 60 years. Recording differences in ethnicity and family history were not explained by geographic differences or difference in data capture period between the two databases. Given the similarity for the other risk factors and treatments, it is likely that the difference in ethnicity and family history recording reflects a difference in recording patterns between the two clinical computer systems rather than a true difference between the two cohorts. A similar pattern for recording of ethnicity and family history was also reported in the validation of QRISK on the THIN database.11 ,26 This was thought to be due to different usage of clinical templates in the clinical system, with EMIS practices having ethnicity and family history included more often thereby prompting the user to enter this information in a more systematic fashion.

Comparison of QResearch and CPRD incidence rates

The age standardised incidence rates for each condition were generally marginally higher on QResearch than CPRD although the proportions of events identified on GP data (out of all events recorded on either GP or linked mortality data) were very close. This suggests that patterns of recording of major clinical events are very similar between QResearch and CPRD although the absolute value varies by clinical condition. For example, 91% of ischaemic stroke events recorded on either GP or linked mortality data are identified on the GP record compared with 99% of hip fractures. We also note the lower levels of total cardiovascular events in the GP clinical record which was between 13% and 15% lower than the total recorded on either the GP record, the linked mortality record or the linked hospital admissions record. Some of this will reflect new sudden events where the first presentation was a hospital admission or death while others may reflect some under-representation of existing cases not recorded in the GP record. Our study is unable to distinguish between these two scenarios, though the latter one potentially has clinical consequences if the patient is not identified as having cardiovascular disease as they may not be offered secondary prevention.

We think that the information on baseline characteristics and incidence rates will have a utility beyond the present study since it suggests that both databases are fundamentally similar in many aspects and likely to generate similar results for a range of epidemiological studies.27

Summary

In summary, we have tested a set of QPrediction scores using an external independent cohort of practices contributing to the CPRD. The results demonstrate good performance, comparable to the results obtained from QResearch, meaning that the findings of studies performed in either database are likely to be applicable in England.

Acknowledgments

We thank EMIS and EMIS practices for their contribution to the QResearch database. We thank CPRD and Vision Practices for allowing access to the CPRD for this study.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:

Footnotes

  • Contributors JH-C initiated the study, undertook the literature review, data extraction, data manipulation and primary data analysis and wrote the first draft of the paper. JH-C is the guarantor. CC contributed to the design, analysis, interpretation and drafting of the paper. PB contributed to the development of core ideas, the analysis plan, interpretation of the results and drafting of the paper.

  • Funding The validation of the QPrediction scores is funded by the National Institute for Health Research's School for Primary Care Research (project reference number 094).

  • Competing interests JH-C is professor of clinical epidemiology at the University of Nottingham and codirector of QResearch—a not-for-profit organisation which is a joint partnership between the University of Nottingham and EMIS (leading commercial supplier of IT for 60% of general practices in the UK). JH-C is also director of ClinRisk Ltd which produces open and closed source software to ensure the reliable and updatable implementation of clinical risk algorithms within clinical computer systems to help improve patient care. CC is associate professor of Medical Statistics at the University of Nottingham and a consultant statistician for ClinRisk Ltd. PB has received financial support for undertaking the validation work from the National School for Primary Care Research.

  • Ethics approval The project was approved in accordance with the QResearch agreement with Trent Research Ethics Committee (ref 03/04/021) and approved by the ISAC committee of the CPRD (ref 13_079).

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.