Article Text

Download PDFPDF

Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker
  1. Karel G M Moons1,
  2. Andre Pascal Kengne1,2,3,
  3. Mark Woodward2,4,
  4. Patrick Royston5,
  5. Yvonne Vergouwe1,
  6. Douglas G Altman6,
  7. Diederick E Grobbee1
  1. 1Julius Centre for Health Sciences and Primary Care, UMC Utrecht, Utrecht, The Netherlands
  2. 2Cardiovascular Division, The George Institute for Global Health, University of Sydney, Sydney, Australia
  3. 3NCRP for Cardiovascular and Metabolic Diseases, South African Medical Research Council, University of Cape Town, Cape Town, South Africa
  4. 4Department of Epidemiology, Johns Hopkins University, Baltimore, USA
  5. 5MRC Clinical Trials Unit, London, UK
  6. 6Centre for Statistics in Medicine, University of Oxford, Oxford, UK
  1. Correspondence to Professor Karel Moons, Julius Center for Health Sciences and Primary Care, UMC Utrecht, PO Box 85500; 3508 GA Utrecht. The Netherlands; k.g.m.moons{at}umcutrecht.nl

Abstract

Prediction models are increasingly used to complement clinical reasoning and decision making in modern medicine in general, and in the cardiovascular domain in particular. Developed models first and foremost need to provide accurate and (internally and externally) validated estimates of probabilities of specific health conditions or outcomes in targeted patients. The adoption of such models must guide physician's decision making and an individual's behaviour, and consequently improve individual outcomes and the cost-effectiveness of care. In a series of two articles we review the consecutive steps generally advocated for risk prediction model research. This first article focuses on the different aspects of model development studies, from design to reporting, how to estimate a model's predictive performance and the potential optimism in these estimates using internal validation techniques, and how to quantify the added or incremental value of new predictors or biomarkers (of whatever type) to existing predictors. Each step is illustrated with empirical examples from the cardiovascular field.

  • Prediction model
  • risk prediction
  • model development
  • internal validation
  • model improvement
  • reclassification
  • added value
  • biomarkers
  • obesity
  • clinical hypertension
  • prevention
  • diabetes
  • general practice
  • epidemiology

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Risk prediction models use predictors (covariates) to estimate the absolute probability or risk that a certain outcome is present (diagnostic prediction model) or will occur within a specific time period (prognostic prediction model) in an individual with a particular predictor profile.1–5 A model refers to the (mathematical) function which relates the presence or occurrence of the outcome of interest to a set of predictors. Predictors may range from subject characteristics (eg, age and sex), history and physical examination results, to imaging, electrophysiology, blood, urine, coronary plaque or even genetic markers. In the cardiovascular domain, well known prediction models are the Framingham,6 SCORE,7 ASSIGN,8 EUROSCORE,9 PROCAM10 and Wells' scores.11 12 Prediction models are developed, in most cases, to guide healthcare professionals and individuals in their decision making regarding further management—including additional testing and initiating or withholding treatment(s) and lifestyle changes—to inform individuals about their risks of having (diagnosis) or developing (prognosis) a particular disease or outcome.13 They are not meant to replace qualitative reasoning of healthcare professionals or to take over their job, but rather to supplement their reasoning and decision making by providing more objectively estimated probabilities.13–17

In the current era of risk-tailored and personalised cardiovascular care, studies on prediction models are abundant.18–20 This will only further increase with the ever-increasing interest in searching for novel (cardiovascular) biomarkers, varying from simple blood markers, such as C-reactive protein, to more invasively measured markers in atherosclerotic plaque material,21 cellular markers22 and genetic or proteomic markers.23 The recent statement of the American Heart Association on criteria for the (phased) evaluation of markers of cardiovascular risk24 underlines this increase. A key term in this statement was ‘multivariable prediction model’; cardiovascular markers should not (simply) be evaluated in isolation for their predictive abilities but rather on their added predictive contribution beyond existing or established predictors requiring a multivariable approach in design, conduct, analyses and reporting.25 Thus, if there are known predictors or even existing prediction models for the outcome under study, researchers should quantify whether they may be usefully extended with the new marker, or whether existing predictors may even be replaced by it. This recommendation24 25 is of utmost importance in the ‘omics’ setting where huge numbers of markers are usually studied in high-throughput studies, and frequently each genetic marker is tested separately for its association with the outcome. Besides the considerable danger of false positive findings,26–28 the predictive ability of a marker in isolation is no guarantee of a true predictive role beyond established predictors.29

This increased attention to multivariable prediction models does not automatically imply that prediction model research is well conducted and reported. Various reviews have discussed the poor reporting and conduct in the field of clinical prediction modelling.30–35 This has also led to the recent Genetic Risk Prediction Studies (GRIPS) statement to strengthen the reporting and, indirectly, conduct of risk prediction studies with genetic predictors.36 37 Moreover, the number of published prediction models, even for the same disorder or clinical domain, has sharply increased in the last decade.2 Currently, it is often hard for practitioners to determine whether and when to use which particular model, to support their decision making. A consequence may be lack of confidence in the modelling approach to prediction and, instead, a reliance on personal judgement alone.

Prompted by these reviews, in a series of two articles we provide an overview of the three successive steps in prediction model research. As we recently described,13 38–40 these steps involve studies aimed at developing and internally validating a prediction model; testing, and if necessary, adjusting or updating the model for other individuals (external validation); assessing the model's impact on therapeutic management and individual outcomes. This first article of the series focuses on the first step, including the assessment of the incremental value of a new predictor, while the other two steps are covered in the second article.41 For each step we discuss the main issues of design, analysis and interpretation, and illustrate these with empirical examples from the cardiovascular domain. Our aim is to better guide healthcare professionals to interpret the massive number of papers in the field of cardiovascular prediction modelling.

Focus of the series

We focus on prognostic prediction models in cardiovascular medicine, but the issues addressed can be applied to diagnostic prediction models. Diagnostic models typically focus on prediction of the presence or absence of disease (binary outcome). Prognostic models use similar reasoning and similar binary outcomes if the follow-up period is relatively short, although frequently prognostic outcomes are captured in the form of times from a well defined time origin until their occurrence much later (eg, years) in time (ie, time to event outcome). Moreover, prognostic cardiovascular prediction models can be developed for primary or secondary prevention. Many cardiovascular disease (CVD) scores have been developed from individuals selected from the general population to predict future CVD events, such as the above-mentioned Framingham risk score,6 or from more specific population subgroups, such as the ADVANCE CVD risk score,42 which was developed from individuals with diabetes. Finally, we focus on models developed to predict the risk of developing (‘hard’) outcome events, such as myocardial infarction, stroke or cardiovascular-related death, rather than on the prediction of continuous outcomes (such as blood pressure, haemoglobin A1c levels, coronary artery calcium scores, or quality of life), simply because most prediction models, by far, are risk prediction models.

Developing a prediction model

The development of a multivariable prediction model generally requires identification of the important predictors out of a set of preselected candidate predictors; assigning the relative weights for each predictor in a combined risk score; estimating the model's predictive performance including its calibration, discrimination and (re)classification properties; assessing its potential for optimism using so-called internal validation techniques; and, if necessary, adjusting the model for over fitting.13

Box 1 provides an overview of the most important issues in studies aimed at developing a prediction model. Below we highlight a few specific issues.

Box 1

Guide on the main design and analysis issues for studies aimed at developing a prediction model, including estimating the added value of a new predictor or (bio)marker

Design

  • Objective: to develop a model/tool to enable objective estimation of outcome probabilities (risks) according to different combinations of predictor values.

  • Study participants: individuals with the same characteristic, for example, individuals with a particular symptom or sign suspected of a particular disease or with a particular diagnosis, at risk of having (diagnostic prediction model) or developing (prognostic prediction model) a specific health outcome.

  • Sampling design: cohort, preferably prospective to allow for optimal documentation of predictors and outcomes, including a cohort of individuals that participated in a randomised therapeutic trial. Case–control studies are not suitable, except nested case–control or case–cohort studies.

  • Outcomes: relevant to individuals, and preferably measured without knowledge of the measured predictor values. Methods for outcome ascertainment, blinding for the studied predictors and duration of follow-up (if applicable) should be clearly defined.

  • Candidate predictors: theoretically, all potential and not necessarily causal correlates of the outcome of interest. Commonly, however, pre-selection based on subject matter knowledge is recommended. Similar to the outcomes, candidate predictors are clearly defined and measured in a standardized and reproducible way.

Analysis

  • Missing values: analysis of individuals with only completely observed data may lead to biased results. Imputation, preferably multiple imputation, of missing values often yields less biased results.

  • Continuous predictors: should not be turned into dichotomies and linearity should not be assumed. Simple predictor transformation can be implemented to detect and model non-linearity, increasing the predictive accuracy of the prediction model.

  • Predictor selection in the multivariable modelling: selection based on univariable analysis (single predictor–outcome associations) is discouraged. Preferably, if needed, backwards selection or a full model approach should be used, depending on a priori knowledge.

  • Model performance measures: discrimination (eg, c-index), calibration (plots), and (re)classification measures.

  • Internal validation: bootstrapping techniques can quantify the model's potential for overfitting, its optimism in estimated model performance measures and a shrinkage factor to adjust for this optimism.

  • Added value of predictor/test/marker: should be pursued for subsequent (or new) predictors, certainly if its measurement is burdensome and costly. Since overall performance measures (eg, c-index) are often insensitive for small improvements, reclassification measures may be used for this purpose.

Source of data

The data for developing a prognostic prediction model would ideally come from a prospective cohort, or cohorts. Randomised trials are a special form of prospective cohort study,5 hence trial data are also suitable for developing a prognostic prediction model. However, the predictive effects of the randomised treatments should be tested in the model. Prognostic models obtained from randomised trial data may be less generalisable due to, for example, strict eligibility criteria, increasing the need for testing such models in a non-randomised setting.13 Retrospective cohort studies, using existing subject data usually documented for other reasons such as routine care hospital records, can address longer follow-up times but usually at the expense of poorer, less systematically obtained data.13 Unfortunately, the prognostic literature is dominated by retrospective studies. Finally, typical case–control studies, in which cases and controls are sampled from a source population of unknown size, are efficient for studies aimed at finding the independent predictors of an outcome out of a larger set, but not for developing a prediction model. This is because this design does not allow for estimation of absolute risks, as the correct baseline risk or hazard cannot be retrieved from the data. This is only possible when using a nested case–control or case–cohort design.25 43 The latter designs are particularly cost-effective if predictor measurements are relatively expensive (eg, for imaging markers) or burdensome, if many predictors need to be measured (eg, in proteomic and genomic marker studies), when the outcome is rare and for reanalysis of human material stored in biobanks.25

Outcomes

Outcomes for (prognostic) prediction studies would preferably be those that matter to individuals or patients. These could include death and disease reoccurrence or remission of disease. The duration of follow-up for outcome data collection and methods for outcome measurement and ascertainment should be clearly defined. Ideally, to avoid possible bias, outcome measurement should be blinded to or independent of any knowledge of the predictors under consideration.

Candidate predictors

Candidate predictors are variables that are chosen to be studied for their predictive performance. We stress that these are predictors that are eventually considered in the multivariable analysis and certainly not those eventually included in the final model derived after some predictor selection methods (see below). Candidate predictors can include subject demographics, clinical history, physical examination, disease characteristics, test results and, as discussed above, previous treatments. Theoretically, all variables suspected of being associated with the outcome of interest could be considered as candidate predictors, but this association does not need to be causal. Examples of highly predictive, but non-causal, factors in prediction models are skin colour in the Apgar score and tumour markers as predictors of cancer progression or recurrence.13

Researchers frequently measure more predictors than can reasonably be analysed, let alone be included, in the ensuing model. To reduce the risk of false positive findings (predictors), the so-called ‘EPV (events per variable) 1 to 10 rule of thumb’ is often applied. This rule, which is not based on convincing scientific reasoning, suggests that at least 10 individuals having (developed) the event of interest are needed per candidate variable/predictor to allow for reliable prediction modelling.44 45 Accordingly, some a priori predictor reduction is often needed. For example, one may combine similar predictors to a single one (eg, define ‘cardiovascular disease history’ as one predictor including all different types of cardiovascular disorders) or exclude predictors that are highly correlated with others.46

Finally, predictors should be clearly defined, and measured in a standardised and reproducible way to improve the applicability and predictive stability of the ensuing model by others in new individuals.47 If one wants to formally quantify whether a specific predictor (eg, some imaging test result) may replace existing predictors (eg, some specific metabolomic or genetic marker), the observer of the former should be blinded to the results of the latter, and vice versa, to prevent so-called incorporation bias.48

Data quality

There is unfortunately no consensus on how to evaluate the quality of the data. Investigators must use their judgement. If possible, measurements of the candidate predictors and outcomes should be standardised across participating centres and/or professionals. Predictors for which there is evidence of considerable measurement error or inter-observer variability may be less suitable because these will very likely yield a different predictive ability of the model when tested or applied in other or future individuals.

Missing data

Missing values are common in medical research, including prediction research.49 50 The potential influence of missing values on study results increases with the percentage of data that are missing. Missing data are usually related, directly or indirectly, to other subject information or variables, including the outcomes under investigation. Hence, missing data are usually selectively missing. Therefore, simply excluding the participants with missing values from the analysis reduces the effective sample size and may also lead to inaccurate estimates of the predictor–outcome associations and the predictive performance of the final model because the individuals with completely observed data are then not a random subsample of the original study sample.51–57 Imputation techniques, especially multiple imputation, have increasingly been advocated to address the issue of missing values.51–55 57

In multiple imputation, for any variable (candidate predictors or outcomes) with a missing value, so-called multivariable imputation models are developed using the individuals with observed data. These imputation models are subsequently applied to each individual with a missing value, when the individual's other but observed variables are used to estimate and replace the missing value, resulting in a full dataset called an ‘imputed dataset’. This process is performed multiple times (eg, 10 times) yielding different imputation models and thus different imputed datasets. Typical prediction modelling analyses (see below) can then be applied to each imputed dataset to estimate the predictor's (for example HR's or OR's), and other predictive performance statistics. Finally, using standard procedures, these multiple analysis results are simply averaged to produce one overall result, with accompanying standard errors or CIs accounting for the fact that not all data were actually observed but were partly estimated.51–55 57

Another consideration with missing data is whether a variable that is used to develop the score and is frequently missing in the study may also be unavailable in populations to which the score will later be applied. If so, it is sensible to omit it from consideration in the prediction model.

Modelling continuous predictors

The temptation to convert a continuous variable into categories should be resisted, largely because it loses information compared with when the continuous form of the variable is used.46 58–61 However, linearity of the continuous predictor–outcome association should not automatically be assumed—to do so can lead to incorrect interpretation of the effects of the predictor and inaccurate predictions when the model is applied in new individuals. Simple predictor transformations should be systematically tested to explore non-linearity. Such transformations include fractional polynomials and restricted cubic splines.46 58–61

Developing the final model (predictor selection)

There is no consensus about the best method of arriving at the final model; that is, how candidate predictors are to be selected for inclusion in the multivariable analyses and subsequently how predictors are selected for inclusion in the final prediction model. Two broad common strategies are found in the literature, with variants within each strategy: full model versus predictor selection strategy.

In the full model approach, all a priori selected candidate predictors are included in the multivariable analyses and no further predictor selection is used: all candidate predictors are included in the final prediction model. Proponents argue that this avoids so-called predictor selection bias (eg, incorrectly including spurious predictors in the final model) and overfitting.2 46 The full model, however, is often not easy to define as it requires prior knowledge about the most promising candidate predictors, certainly when the number of events is limited and studying too many candidate predictors must be avoided.46 62

The other main approach is the use of predictor selection in the multivariable analyses. Here, candidate predictors that do not contribute usefully in the multivariable model are removed. Backward elimination starts with all candidate predictors in the multivariable model and runs a sequence of tests to remove or keep variables in the model based on a predefined nominal significance level for variable exclusion, for example, using the log likelihood ratio test for comparing two models. Conversely, in the less preferable forward selection approach, the model is built up in steps from the best candidate predictors.63 Compared with backward elimination, forward selection does not provide for a simultaneous assessment of the effects of all candidate variables.29 In addition, correlated variables may remain in the model using backward elimination, while none of them might enter the model using forward selection.64

The choice of a relatively small significance level (eg, p<0.05 or even p<0.01) generates models with fewer predictors, though missing potentially important predictors, while larger levels (eg, p<0.20 or p<0.25) increase the risk of selecting less important predictors. In both cases, so-called overfitted models may arise, specifically in small datasets. Selection additionally leads to unstable models because the selected predictors will vary depending on the specifics of the dataset at hand. Therefore, regardless of which type of variable selection and p value is used, subsequent internal validation of the models—using, for example, bootstrapping techniques in which this predictor selection process is repeated in every bootstrap sample (see below)—is recommended to gain insight into the likelihood of the model missing important variables, being overfitted or unstable.46 62

Regardless of which predictor selection method is used in the multivariable analyses, in line with previous studies, we suggest not excluding predictors for multivariable analyses on the sole consideration that a predictor is not statistically significantly related to the outcome (eg, not having a p value<0.05) in the univariable analysis.38 46 62 65–67 Univariable analyses estimate each individual predictor–outcome association.

Assigning the relative weight per predictor

The multivariable analysis estimates regression coefficients (eg, log odds or HR) of each predictor included in the final model, which are mutually adjusted for the other predictors in the model. The coefficients thus quantify the contribution of each predictor to the outcome probability or risk estimation. More technically, a regression coefficient indicates the effect of a one-unit (or one-step in the case of categorical variables) increase in the level of the relevant predictor on the estimated outcome risk when other predictors in the model are kept constant. Another important parameter from a regression analysis in prediction modelling research is an estimate of the baseline probability or risk (or hazard)—the estimated risk (hazard) for an individual with all predictor values being zero. For logistic regression the baseline risk is indicated by the model's intercept. For Cox survival models, which have no intercept, the baseline event risk can be estimated separately.2 Accordingly, predicted probabilities for developing the event within a certain time period can be calculated for individuals by combining the intercept or estimated baseline hazard, the observed values of the predictors and the corresponding regression coefficients in mathematical functions that are specific to the statistical methods used to develop the model. Below we provide various examples, with a logistic model and a Cox model.

Assessing the predictive performance

Discrimination and calibration are key aspects of predictive performance of prediction models. Calibration is the agreement between the probability of developing the outcome of interest within a certain time period as estimated by the model and the observed outcome frequencies. It is ideally assessed graphically by plotting the observed outcome frequencies against the mean predicted outcome probabilities or risks, within subgroups of participants that are ranked by increasing estimated probability.46 62 The plot can be supplemented with formal statistical testing for goodness of fit. This is generally done by using the Hosmer and Lemeshow test suitable for logistic68 or survival69 models, and equivalents,70 71 although these tests tend to reflect good model fit due to their lack of statistical power.

Discrimination is the ability of a model to distinguish individuals who experienced the outcome from those who remained event free, and can be estimated both for logistic models and survival models. Several statistics are available to summarise discrimination, though the c-index (equal to the area under the receiver operating characteristic curve for logistic models) seems the most widely used. Generalised versions of the c-index for survival analysis, allowing for censoring, have been developed.69 72–74 For a prognostic model, the c-index is the chance that given two individuals, one who will develop the event of interest and one who will remain event free, the prediction model will assign a higher probability of an event to the former.

Internal validation

Prediction models can be expected to perform optimistically in the data sample from which they are developed compared with the performance found when tested in new but comparable individuals. This is simply because the model was designed to optimally fit the development sample but becomes less accurate when tested in new but similar individuals (overfitting). The potential for optimism in model performance increases when the number of outcomes/events in the development sample decreases and the number of candidate predictors in the development sample (relative to the number of events) increases. Furthermore, model performance is often best when no prediction selection strategies are used.46 62 67 75

To estimate the potential for overfitting and optimism in model performance, internal validation techniques are advocated. Internal validation means that no other data than the study sample are being used. Although this is commonly done by randomly splitting the dataset into two subsets, a development sample (eg, two-thirds of the original dataset) and a validation sample, this approach is statistically inefficient (ie, it ‘wastes data’) because not all available data are used to produce the prediction model. Also, there is an issue of ‘replication instability’ in that different random splits give different results. Bootstrapping is therefore the preferred method for internal validation, certainly when the development sample is relatively small and/or a high number of candidate predictors is studied.38 46

In medical research, a study sample is supposed to be a random draw from a larger (theoretical) target or source population. When a new sample is randomly drawn, it will be similar but not identical, and estimated predictor–outcome associations and model performances (eg, c-index) may differ due to this sampling effect. The larger the study sample, the more it reflects the source population, and the less the performance in the study sample deviates from the performance that would theoretically be found in the source population. Bootstrapping is a statistical method that aims to mimic this sampling process using only the data at hand by sampling with replacement a study sample of the same size (to preserve the precision) from the original study sample in which the prediction model was developed. Drawing with replacement mimics that random component, making bootstrap samples similar but not identical to the original study sample. In each bootstrap sample (often 100 or 500 samples), the data are analysed as in the original study sample, repeating each step of the model development including applied predictor selection strategies. This may yield a different model developed from each bootstrap sample with corresponding c-index. Subsequently, each bootstrap model is applied to the original study sample (mimicking the source population), yielding a difference in c-index. The average of all these ‘c-index differences’ indicates the optimism in the apparent c-index of the prediction model that was initially developed in the original study sample.2 46 76

With bootstrapping, all data are thus used for model development and it provides insight into the extent to which the developed model (in the original study sample) is overfitted and too optimistic. Moreover, bootstrapping techniques can account for the influences of all predictor selection steps taken in the analyses by repeating the entire selection process in every bootstrap sample. Finally, bootstrapping provides two key estimates: an estimate of the optimism in predictive performance of the developed model, with which the c-index can be adjusted (lowered) to better approximate the expected model performance in new samples; and a so-called shrinkage factor which can be used to adjust the estimated regression coefficients in the final model for overfitting.2 46 76 Thus, bootstrap-adjusted performance (eg, c-index) better reflects what can be expected when the model is tested or applied in new individuals from the same theoretical source population. Obviously, the larger the development dataset, the smaller the chance that bootstrapping will reduce optimism. This is not a failing of the bootstrap; it is simply that model building is more reliable and less data driven in larger study samples. We emphasise that no internal validation methods can be a substitute for external validation.41

Presentation of the model

The final prediction model should always be presented as the original regression model equation, that is, regression coefficients (including the intercept for a logistic model). Accordingly, future researchers and users can apply the model to new individuals to obtain predicted risk. Such a model can also be made available as an online calculator (see eg, http://ASSIGN-score.com) or as a nomogram.46

A model can also be presented as a simplified but approximate model or scoring rule when the original predictor weights (regression coefficients) are converted and rounded to numbers that are easy to add, which are then related to absolute outcome probabilities.77 How this approximation is done has implications for the impact of the model in practice (as discussed in the next paper of this series). Moreover, such rounding and simplification typically leads to loss of information conveyed by the predictors, and thus to a reduction in predictive performance (eg, c-index). Hence, it is recommended to also provide the c-index of the simplified model to enable readers to compare it with the original, untransformed model performance.

Empirical example 1: development of the ADVANCE CVD prediction model

The ADVANCE study was a randomised trial of the efficacy of blood pressure and glucose control on macrovascular and microvascular events among high-risk individuals with type 2 diabetes.78 These data were subsequently used to develop a prediction model for the risk of developing major CVD in individuals without a history of the disease: 66% (7550) of the total study population.42 The model development included the following steps:

  • Cox proportional hazard (PH) analysis was used for model development, accounting for the observed follow-up time for all eligible participants, up to the 75th percentile (4.5 years) of the total study duration. This choice was made to avoid the likely influence of longer follow-up duration among few participants on model estimates. The model was then fitted over the first 4 years of follow-up as the participants had at least been followed until this duration, such that the baseline event or survival probability for 4 years could be accurately derived.

  • The authors a priori selected 26 candidate predictors, including randomised treatment allocation, on the basis of prior (clinical and literature) knowledge of their association with the outcome and their ease of availability in routine diabetes practice.

  • A total of 473 major cardiovascular events were recorded during the follow-up, yielding 18 events per candidate predictor.44

  • 5% of the 7550 participants had missing data. The authors excluded these participants because the characteristics, including outcome incidences, were largely similar to those of participants with completely observed data.

  • Because of their skewed distributions, urinary albumin/creatinine ratio, serum creatinine and triglycerides were transformed to approximate normality by taking logarithms. All other continuous variables were modelled as linear after testing for the assumption of linearity.

  • The multivariable model was fitted using backwards selection by eliminating candidate predictors one by one using the 5% significance level. Predictors eliminated were re-entered in the eventual final multivariable model to ensure that no omitted predictor significantly reduced the log likelihood (χ2) of the model.79

  • The authors estimated the c-index and assessed calibration using the modified Hosmer and Lemeshow test for survival data69 in the cohort that was used to build the model (apparent performance). The c-index was then adjusted for optimism using bootstrapping techniques (internal validation). The Hosmer and Lemeshow test was computed by comparing the estimated probability of CVD from the model to the observed probability of CVD from Kaplan–Meier estimates within subgroups of participants ranked by increasing estimated probabilities.69

Table 1 shows the final ADVANCE model. Positive regression coefficients indicate an increased risk of CVD. The apparent c-statistic was 0.702 (0.676–0.728); after bootstrapping this fell to 0.699 (0.695–0.702). The p value for the Hosmer and Lemeshow test was 0.76, indicating good agreement between observed and predicted probability overall and within subgroups of participants ranked by increasing predicted probabilities.

Table 1

Regression coefficients (95% CI) and SE for predictors in the final ADVANCE cardiovascular disease prediction model42

Assessing the added value of a new predictor or (bio)marker

Every new predictor, test or (bio)marker differs in predictive accuracy, invasiveness and cost of its measurement. Accordingly, tests or markers, especially those whose collection requires more burdensome and costly measurement, should not be evaluated on their individual predictive abilities but rather on the incremental predictive value beyond established, and easy to obtain, predictors.24 25 80 Measures of discrimination such as the c-statistic are insensitive to detecting (small) improvements in model performance when a new marker is added to a model that already includes important predictors.65 81–83 Prompted by this, there has been a recent trend towards new metrics which estimate the added value of predictors. These quantify the extent to which an extended model (with addition of a subsequent predictor or marker) improves the classification of participants with and without the outcome compared with the basic model without that predictor.82–86 For example, the net reclassification improvement (NRI) does this by quantifying the number of individuals that are correctly reclassified into clinically meaningful higher or lower risk categories with the addition of a new predictor, using pre-specified risk groups.84 Correct reclassifications are shifts to a higher risk category in cases and shifts to a lower risk category in non-cases. Definition of these risk groups, however, is often arbitrary and differs across studies, which may compromise comparisons of NRIs from different studies. To circumvent this problem, a version of the NRI that does not require stratification of the population into risk group may be used.87 Alternatively, the integrated discrimination improvement (IDI) may be useful. In contrast to the NRI, the IDI does not require subjectively, predefined risk thresholds. The IDI is the estimated improvement in the average sensitivity of the basic model with addition of the new predictor minus the estimated decrease in the mean specificity, summarised over all possible risk thresholds.84

Empirical example 2: added value of eGFR to the ADVANCE CVD prediction model

The ADVANCE investigators tested whether an extended model that additionally included the estimated glomerular filtration rate (eGFR) could improve predictions. They calculated the eGFR using the modification of diet in renal disease (MDRD) formula.88 The apparent c-statistic for this extended model (0.702) was similar to the basic model in table 1. They further compared the two models by assessing the NRI for three predefined risk categories (table 2).84 In 429 participants who developed a major CVD event during follow-up, the extended model reclassified only two participants correctly upward (improvement) and three other participants incorrectly downward (deterioration), with a net loss in reclassification among the cases of 0.23%. In the 6739 participants who remained event free during follow-up, the extended model reclassified 82 correctly downward (improvement) and 67 incorrectly upward (deterioration), with a net gain in reclassification among the non-cases of 0.24%. The overall NRI was thus 0.01% (p=0.99), giving another indication that the two models were similar, with no indication that eGFR added additional prognostic value over and above the variables shown in table 1.

Table 2

Cardiovascular disease (CVD) risk classification comparing the ADVANCE CVD prediction model with an extended model which added estimated glomerular filtration rate

Concluding remarks

Modern medicine, in general, and cardiovascular medicine, in particular, increasingly relies upon diagnostic and prognostic prediction models to inform individuals and their healthcare professionals about the risks of having or developing a particular disease or outcome, and to guide decision making aimed at mitigating such risks. To be useful for these purposes, a prediction model must provide validated and accurate estimates of the risks, and the uptake of those estimates obviously should improve individual management and in turn (relevant) individual's outcomes and cost-effectiveness of care. This is reflected in major steps of prediction modelling research that include prediction model development (including so-called internal validation); (external) validation of the prediction model; and analysis of the impact of using a prediction model on individual outcomes. This article has focused on the first step, and the second article of the series41 will address the other two steps. Model development should follow a rigorous methodology. Reporting of a newly developed prediction model should include enough details on the actual development, and provide all parameters, notably all estimated regression coefficients with accompanying (im)precision estimates, that will allow future researchers to comprehensively validate the model using their own participants, and practitioners to actually apply the model to their patients. New potentially predictive (bio)markers should be assessed on their added value to existing prediction models or predictors, rather than simply being tested on their predictive ability alone.

References

Footnotes

  • Linked article 301247.

  • KGMM and APK contributed equally.

  • Funding Karel GM Moons receives funding from the Netherlands Organisation for Scientific Research (project 9120.8004 and 918.10.615).

  • Competing interests None.

  • Provenance and peer review Commissioned; externally peer reviewed.

Linked Articles