Article Text
Statistics from Altmetric.com
Strengths and limitations of this study

The Cardiovascular Disease Population Risk Tool (CVDPoRT) will use data on major health behavioural risk factors from large populationbased community health surveys individually linked to routinelycollected health administrative data in Ontario, Canada, to develop and validate a populationbased risk algorithm for CVD.

CVDPoRT will improve the ability to answer key policy questions with respect to the future burden of CVD in Canada, the contribution of major health behaviours to the population burden of CVD, the preventive benefit of achieving health behaviour goals and strategies to reduce inequities through improvements in health behaviours.

The analysis plan adheres to published recommendations for the development of valid risk prediction models to limit the risk of overfitting and improve the quality of predictions.

Although a rigorous approach will be used to develop the model, including internal and external validation, stronger forms of validation may be required: future validation studies should include application in different geographic locations, and fully independent validation by independent investigators using alternative measurement of these risk factors in different population settings.

The model development will focus on maximising predictive accuracy and as such, will not consider the causal and mediator effects of the predictive variables.
Introduction
Disease risk algorithms for the population setting
Numerous prognostic models have been developed to predict the risk of future disease for individual patients in clinical settings. Populationbased prognostic models are less common, but are essential for population health planning and policy decisionmaking. Unlike clinical models, they are usually derived using population data and may utilise selfreported risk factors that do not require laboratory or clinical measurement. The Cardiovascular Disease Population Risk Tool (CVDPoRT) will use data on major health behavioural risk factors (smoking, diet, physical activity, and alcohol use) from a large populationbased community health survey individually linked to routinelycollected health administrative data in Ontario, Canada, to develop and validate a populationbased risk algorithm for CVD. Once validated, CVDPoRT will improve the ability to answer key policy questions with respect to the future burden of CVD in Canada, the contribution of major health behaviours to the population burden of CVD, the preventive benefit of achieving health behaviour goals, and strategies to reduce inequities through improvements in health behaviours.
Countries with clinical guidelines for CVD generally recommend patient risk assessment and stratification using multivariable risk algorithms, such as the Framingham risk tool.1–3 Improving population health risk assessment has been identified as a priority in Canada.4 ,5 Clinical risk algorithms, such as Framingham, are challenging to adapt for population health planning because they require clinical measures such as blood pressure and lipid levels. Although it can be more challenging to develop prediction models that have acceptable discrimination and calibration without the use of clinical measures, previous studies suggest that approximately 50% of CVD may be related to health behaviours6 ,7; moreover, we have previously demonstrated that disease risks can be accurately assessed for population uses using only selfreported risk factors.8–10 There are several advantages to developing populationbased prediction models without clinical measures. First, surveys that assess only selfreported risk factors are usually much larger than those that include clinical measures. Second, since populationbased health surveys are now being conducted in over 100 countries, such algorithms have a broader scope of potential application. Third, populationbased algorithms allow for estimation of populationlevel disease risk. Fourth, the inclusion of selfreported risk factors and health behaviours complements existing clinical risk algorithms whose focus is biophysical measures such as lipids and hypertension. Finally, selfreported risk factors are easily ascertained by individuals, which facilitates implementation of internetbased risk calculators in community settings.
It is recognised that selfreported risk factors may introduce greater measurement error than biophysical or clinical measures, and that this may adversely affect the performance of prognostic models. Whereas demographic characteristics such as education and some behaviours such as smoking are usually measured with little error, other risk factors such as diet, physical activity, body mass index (BMI), and especially alcohol consumption, may have more substantial bias.11 These errors may be minimised through the use of a comprehensive set of sociodemographic and behavioural risk factors that are commonly ascertained in population health surveys, using reliable methods with wellestablished exposure questions and limited rates of missing data. Moreover, the effect of omission of biophysical measures (eg, measured blood pressure) may be minimised by including variables that are correlated with such measures (eg, blood pressure medication).
Methodological issues in prediction model research
Prediction models are more likely to be reliable and useful in practice when they are developed using a large, highquality data set; based on a study protocol with a sound statistical analysis plan; and validated in independent data sets.12 Among thousands of clinical prediction rules published in the past decades, many have been shown to have serious methodological shortcomings.13 A review of 83 clinical prediction models in acute stroke, for example, found serious deficiencies in statistical methods in almost all of the studies; in addition, none of the studies had been adequately validated.14 A series of recent publications have called for substantial improvements in the design, conduct, analysis and reporting of prognostic studies.12 ,15–17 Several threats to validity have been identified, including inadequate sample sizes, datadriven or arbitrary categorisation of continuous predictors, inadequate statistical modelling of nonlinear relationships, inappropriate handling of missing data, and failure to check model assumptions. Statistical overfitting is a particular concern in the development of prognostic models; it results when a model is fitted with too many parameters given the amount of information in the data. In such circumstances, the predictive ability of the model will be overstated and it is likely to perform poorly in different settings.18 When overfitting is present, some of the associations in the model may be spurious, reflecting increased type I error. The use of tests of association for selecting predictor variables, datadriven categorisation or specification of functional form of association with predictors, and stepwise variable selection procedures can increase the risk of type I error.
Before any prognostic model might be adopted in practice it is necessary to show that it provides valid predictions outside the specific context of the sample used to derive the model.19 A recent review of 71 published clinical prediction models in highimpact journals found that only one study included external validation during development, and two recalibrated algorithms after publication.13 Furthermore, none of the 71 algorithms examined calibration for target populations beyond arbitrary risk categories such as deciles of predicted risk. Most focused on discrimination, rather than calibration. This is likely a reflection of the emphasis on identifying highrisk patients during clinical decisionmaking20; for population uses, however, several authors have emphasised the importance of examining calibration in important target populations, particularly when these populations and exposures were not included during algorithm development.21–24 This reflects the intended application of populationbased risk algorithms, which includes assessing resource allocation, equity issues, impact of populationwide prevention strategies and disease burden for different levels of exposure.
Transparency in prediction model research
There are several benefits to publishing study protocols: it may improve study quality through peer review; it allows readers to compare what was originally intended and what was actually done, thus preventing both ‘data dredging’, posthoc revisions of study aims, and selective reporting; it enables funders and researchers to see what studies are underway and hence reduce duplication of research effort; it enhances credibility of the research by allowing others to replicate the study; and it allows easier identification of and access to details of the study. Peat et al25 have stressed the importance of predefining the key aspects of a prognostic study—yet, it seems that most prognostic studies are conducted without a study protocol, with analysis plans being developed during or after data collection. Although it is recognised that a prognosis research protocol cannot be a rigid blueprint and that it is neither possible nor desirable to prespecify all analyses,25 the development of CVDPoRT is especially amenable to prespecification because risk factors for CVD have been well studied and many have known relationships to CVD. Given the goal of generalising to other populationbased settings, it is particularly important to avoid overfitting. We are presenting our study protocol to improve transparency and protect against bias. Our protocol adheres to a recommended checklist of items to include in protocols for prognostic studies.25
Objectives
The objective of this study is to develop and validate CVD risk prediction models for the population setting using selfreported risk factors with a focus on major health behaviours. We will use the Ontario sample of the Canadian Community Health Survey (CCHS) individually linked to routinelycollected data to ascertain CVD incident events. Separate models will be derived for men and women.
Methods and analysis
Outcomes
The primary outcome of interest is a major CVD event, ascertained using validated diagnostic codes and criteria as presented in table 1. Additional prediction models will be derived for secondary outcomes of interest, defined in table 2.
Design
CVDPoRT will be derived and validated using secondary data. The derivation cohort will be eligible respondents to the combined 2001, 2003 and 2005 Canadian Community Health Surveys (CCHS cycles 1.1, 2.1 and 3.1), conducted by Statistics Canada. The CCHS surveys use a multistage stratified cluster design that represents approximately 98% of the Canadian population aged 12 years and above, and attains an average response rate of 80.5%. The surveys are conducted through telephone and inperson interviews and all responses are selfreported. The details of the survey methodology have been previously published.26 All selfreported risk factors of interest will be obtained from the CCHS. To ascertain CVD events, the survey respondents will be individually linked to two populationbased databases: hospitalisation records from the Canadian Institute for Health Information Discharge Abstract Database, and vital statistics. Secondary outcomes will also require linkage to the Ontario Health Insurance Plan database. Respondents will be followed until the earliest of: incident event, death (defined as a competing risk), loss to followup (defined as loss of healthcare eligibility), or end of study (31 December 2011 or most recent year available). The validation cohort will consist of respondents to the 2007 and 2009 surveys, similarly linked to ascertain outcomes. Owing to the known challenges of using survey weights in regression models,27 including difficulties in obtaining correct estimates for SE, and complexity of modelling procedures and interpretation of results, no survey weights will be incorporated in the development of CVDPoRT.
Eligibility criteria
Respondents will be excluded if they were not eligible for Ontario's universal health insurance programme, were pregnant, selfreported a history of heart disease or stroke, or were younger than age 20 at the time of survey administration. If a respondent was included in more than one CCHS cycle, only their earliest survey response will be used. The same exclusion criteria will be applied to the respondents in the validation cohort.
Sample size
The derivation cohort consists of 77 251 respondents and 619 886 personyears of followup until 31 December 2011; the validation cohort will consist of approximately 50 000 respondents and 150 000 personyears of followup. The number of events until 31 December 2011 in the derivation cohort is 1131 for men and 1102 for women; in the validation cohort we expect approximately 250 events for men and 250 for women. Harrell18 describes sample size requirements for prediction models. For time to event outcomes, the number of participants experiencing the event must exceed 10 times the number of degrees of freedom, where the number of degrees of freedom includes the number of predictors screened for association with the outcome, all dummy variables, nonlinear terms and interactions. For CVDPoRT, the target number of total regression degrees of freedom is less than 110. The minimum sample size requirement for external validation studies is 100 events and 100 nonevents.
Analysis plan
We closely followed guidelines by Harrell18 and Steyerberg28 in the development of our analysis plan, which was constructed after accessing the derivation data set, but prior to any model fitting or any descriptive analyses involving the exposureoutcome associations. Key considerations in our approach were fully prespecifying the predictor variables, use of flexible functions for continuous predictors, and preserving statistical properties by avoiding datadriven variable selection procedures. Analyses will be conducted using Harrell’s HMisc29 package of functions in R30 as well as SAS V.9.3.
Identification of predictors
Identification of predictor variables was based on reviewing the available data collected across all cycles in the CCHS together with subjectmatter expertise, and was informed by our previous work in developing models for diabetes, stroke, and life expectancy.9 ,10 Questionnaires used in the CCHS are available elsewhere.31 The following categories of predictors were considered: sociodemographic, health behaviours and morbidities. Some variables needed to be constructed from multiple items in the survey questionnaire. Variables with more than 20% missing values were excluded from consideration. Variables with narrow distributions or insufficient variation were excluded. Obvious cases of redundancy (eg, alternative definitions of the same underlying behaviour) were ruled out. A formal check of multicollinearity was carried out using a variable clustering algorithm.18 A total of 22 predictor variables were finally identified: 7 sociodemographic, 11 behavioural, and 3 disease risk factors, with 1 design variable. Education—rather than individual income—was selected as a predictor due to several concerns associated with income, including lack of generalisability, measurement error, stability overtime and substantial missingness. An indicator variable for immigration status together with fraction of life lived in Canada was used to account for recent and nonrecent immigrants. Indicator variables for smoking status were created to allow inclusion of smoking packyears as a continuous predictor. The model will additionally include interactions between age and: smoking, alcohol, diet, physical activity, BMI, diabetes, and hypertension, as the effect of these risk factors on CVD are expected to vary with age. Detailed definitions and measurement of these variables are presented in table 3.
Data cleaning and coding of predictors
Data cleaning and coding will proceed without examining outcomerisk factor associations. Coding of variables will focus on minimising the loss of predictive information by avoiding categorisation. Continuous variables will be inspected using boxplots and descriptive statistics to determine values outside a plausible range. Values that are clearly erroneous will be corrected, where possible, or otherwise set to missing. Truncation to the 99.5th centile or where the data density ends will be considered for continuous risk factors with highly skewed distributions (eg, smoking packyears, diet, alcohol consumption, physical activity) based on inspection of histograms and boxplots. To avoid instability in the regression analyses, frequency distributions for categorical predictors will be examined and categories with small numbers of respondents will be combined.
Missing data
Traditional complete case analyses suffer from inefficiency, selection bias and other limitations.28 We will use multiple imputation to impute missing values on all predictors, using the ‘aregImpute’ function in the HMisc library. This procedure simultaneously imputes missing values while determining optimal transformations among all imputation variables. Predictive mean matching is used to replace missing values with random draws of observed values from participants with the nearest predicted values. The imputation model will consist of the full list of predictor variables, along with time to event and censoring variables, as well as auxiliary variables—variables that are not predictors, but may nevertheless be useful in generating imputed values, for example, income and selfperceived health. We will generate five multiple imputation data sets. The final model will be estimated separately for each completed data set and the results combined using the rules developed by Rubin and Schenker32 to account for imputation uncertainty.
Model specification
Using the approach described by Harrell,18 ,29 we will fit an initial main effects model that includes an initial degree of freedom allocation for each predictor. We will then decide how to allocate final numbers of degrees of freedom to individual predictors based on a partial test of association with the outcome. Decisions on initial degree of freedom allocations will be informed by the anticipated importance of each predictor and any known dose–response relationships with CVD (eg, known “U” or “J” shaped relationships for alcohol and BMI). Continuous predictors will be flexibly modelled using restricted cubic splines, that is, piecewise cubic functions that are smooth at the knots and restricted to be linear in the tails. The knots will be placed at fixed quantiles of the distribution: in particular, at the 5th, 27.5th, 50th, 72.5th and 95th centiles. Ordinal variables with few categories will be specified as either linear terms, or as categorical if the expected association is more complex than linear. Interactions will be restricted to linear terms. The initial model specification, presented in table 3, includes a total of 61 degrees of freedom (47 main, 14 interaction), compared to a possible maximum of 110. Partial association χ^{2} statistics for each predictor variable minus their degrees of freedom (to level the playing field among predictors with varying degrees of freedom) will be plotted in descending order. Variables with higher predictive potential will be allocated more degrees of freedom, but predictors with lower predictive potential will be modelled as simple linear terms or recoded by combining infrequent categories. As described by Harrell,18 ,29 this process of model specification does not increase the type I error because predictors will be retained in the full model regardless of their strength of association, tests of nonlinearity will not be revealed to the analyst and combining categories may include collapsing the most disparate categories as they will also be blinded to the observed rates of events per category.
Model estimation
The initial model will be estimated using competing risks Cox proportional hazards regression with death from a nonCVD cause considered a competing risk; alternative model specifications may need to be considered after assessing validity of model assumptions. All continuous predictors will be centred about their means. A key assumption of the Cox model, that is, that the effect of predictors is constant in time, will be assessed using plots of raw and smoothed scaled Schoenfeld residuals versus time for each predictor. To address serious violations of this assumption, interaction terms between the predictor and logtime will be considered. Influence will be assessed by plotting scaled dfβ residuals for each covariate. Although the risk of overfitting will be minimal due to prespecification of our model and large sample size, we will nevertheless assess the need to adjust for overfitting. The degree of overfitting (shrinkage) in the model will be estimated using the heuristic shrinkage estimator (based on the log likelihood ratio χ^{2} statistic for the full model).33 If shrinkage is <0.90, adjustment for overfitting will be required, as described below.
Assessment of model performance
Steyerberg28 distinguishes between apparent, internally validated and externally validated model performance. ‘Internally validated performance’ corrects for optimism in the apparent performance to yield approximately unbiased estimates of future model performance. Performance in the derivation and validation cohorts will be assessed and reported using overall measures of predictive accuracy, discrimination (ability to differentiate between highrisk and lowrisk individuals), and calibration (agreement between predicted and observed risk). All model performance measures will be calculated using the first of the multiply imputed data sets. Nagelkerke's R^{2} and the Brier score will be calculated as overall measures of accuracy. Discrimination will be assessed using Harrell’s overall concordance statistic, with 95% CIs estimated using bootstrap samples. Internally validated performance measures will be obtained using 200 bootstrap samples, using the procedure described by Steyerberg.28
Steyerberg28 and Cook21 ,22 suggest that calibration should receive more attention when evaluating prediction models, and that assessment of recalibration tests and calibration slopes should be used routinely. We will emphasise visualisation of model performance using plots, rather than formal statistical testing: significance of traditional HosmerLemeshow goodness of fit tests, for example, may reflect large sample sizes rather than true miscalibration. Thus, we will create calibration plots at fixed time points by comparing mean predicted probabilities with KaplanMeier estimates of observed rates stratified by intervals of predicted risk. The calibration slope will be estimated by including the linear predictor as a single term in the model fitted to the validation cohort. Deviation from a slope of one will be tested using a Wald or likelihood ratio test. The calibration slope reflects the combined effect of overfitting to the derivation data as well as true differences in effects of predictors.
Subgroup validation will be implemented as a conceptually easy check of calibration. This entails comparing observed and predicted risks within predefined subgroups of importance to clinicians and policymakers, for example, defined by age groups, behavioural risk exposure categories, health regions, sociodemographic groups, hypertension status and diabetes status. Explicit criteria for clinically or policy relevant standards of calibration will be established, for example, <20% difference between observed and predicted estimates for categories with prevalence higher than 5%.
Estimation of the final model
Prespecification of predictors has advantages in limiting the risks of overfitting and spurious statistical significance, but may result in a final model that is overly complex and difficult to interpret. It may be possible to derive a more parsimonious model that retains most of the prognostic information, and that performs as well or better than the full model, without increasing the type I error rate.18 ,34 We will use the stepdown procedure described by Ambler et al34 to identify a more parsimonious model. This procedure involves deleting variables to a desired degree of accuracy based on contribution to model R^{2}. We will compare the reduced and full models using internal bootstrap validation, with appropriate penalisation for the variable selection. The final model (either the full model or its approximation) will be selected based on comparing calibration slopes, calibration in subgroups of policy importance, and overall measures of predictive accuracy.
To maximise duration of followup, the final regression coefficients will be estimated using the combined data from the derivation and validation cohorts with outcome events updated to reflect the most recent years available. If relevant differences are found between the derivation and validation cohorts, a cohortspecific intercept and/or interaction term may be included in the final model; otherwise, it will maintain the same risk factors and form as the derivation model.
Model presentation
Results will be presented for the derivation, validation and combined cohorts. Given the anticipated complexity of the final regression model, the usual presentation of a regression model showing estimated HRs and 95% CIs is less meaningful. To allow interpretation of the estimated effect of each predictor, the model will be summarised using plots of the shape of the effect of each predictor, as well as Wald χ^{2} statistics, penalised for degrees of freedom. As predictions are of primary interest, presentation will take the form of a regression formula, which will serve as the basis for webbased implementation.
Secondary outcome analyses
Prediction models for secondary outcomes will be derived separately with a new development process, but maintaining the same risk factors and model specification as for the primary outcome. Sensitivity analyses will be carried out using less commonly used diagnostic codes (see table 2).
Analyses beyond initial model development
We also plan to also validate the algorithms using the national sample of the individuallylinked CCHS 1.1, 2.1, 3.1, when these data become available. We will conduct further analyses exploring the predictive ability of novel risk factors that were not previously included (eg, food insecurity), as well as risk factors that were not ascertained in all CCHS cycles (eg, active transportation, workplace stress, depression and anxiety, cholesterol therapy). Risk factors that can be ascertained through linkages of additional data sources and similar cohorts (eg, areabased measures of built environment, air pollution, detailed dietary consumption, lipid levels, glucose levels, measured blood pressure) will additionally be explored. Diet will be examined using a previously developed diet quality measure (Perez measure10) that showed good predictive performance for both allcause mortality and allcause hospitalisation; it uses six diet exposure measures with weighting for carrot, potato and fruit juice consumption. These exploratory risk factors will not be included in CVDPoRT, but will be considered in future updates of the model.
Ethics and dissemination
A project advisory committee has been created to ensure that the risk algorithm development meets the needs of knowledge users. The committee has been involved from the beginning of the study and worked with the study team to rank candidate predictors for inclusion based on policy importance and scientific importance. It will advise on the identification of important target populations, and establish minimal policy important differences for calibration studies. Results from CVDPoRT will be submitted for publication in peerreviewed journals and presentation at scientific meetings. If appropriate for individual use, we will create a webbased CVD calculator. Although CVDPoRT emphasises population risk prediction, our experience has shown that individual calculators are an effective engagement and translation tool for both the general public and knowledge users.
Conclusion
To the best of our knowledge, CVDPoRT will be the first populationbased risk prediction algorithm for CVD. Although a rigorous approach will be used to develop the model, including internal and external validation, stronger forms of validation may be required. Future validation studies should include application in different geographic locations, and fully independent validation by independent investigators using alternative measurement of these risk factors in different population settings.
References
Request Permissions
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.