Article Text

A case–control study on predicting population risk of suicide using health administrative data: a research protocol
  1. JianLi Wang1,
  2. Fatemeh Gholi Zadeh Kharrat2,
  3. Jean-François Pelletier3,
  4. Louis Rochette4,
  5. Eric Pelletier4,
  6. Pascale Lévesque4,
  7. Victoria Massamba4,
  8. Camille Brousseau-Paradis3,
  9. Mada Mohammed1,
  10. Geneviève Gariépy5,6,
  11. Christian Gagné2,
  12. Alain Lesage7
  1. 1Department of Community Health and Epidemiology, Dalhousie University, Halifax, Nova Scotia, Canada
  2. 2Department of Electrical Engineering and Computer Engineering, Laval University, Quebec, Quebec, Canada
  3. 3Department of Psychiatry, University of Montreal, Montreal, Québec, Canada
  4. 4Institut national de sante publique du Quebec (INSPQ), Quebec City, Quebec, Canada
  5. 5Public Health Agency of Canada, Ottawa, Ontario, Canada
  6. 6Department of Social and Preventive Medicine, University of Montreal, Montreal, Québec, Canada
  7. 7Institut universitaire en sante mentale de Montreal, Montreal, Québec, Canada
  1. Correspondence to Dr JianLi Wang;{at}


Introduction Suicide has a complex aetiology and is a result of the interaction among the risk and protective factors at the individual, healthcare system and population levels. Therefore, policy and decision makers and mental health service planners can play an important role in suicide prevention. Although a number of suicide risk predictive tools have been developed, these tools were designed to be used by clinicians for assessing individual risk of suicide. There have been no risk predictive models to be used by policy and decision makers for predicting population risk of suicide at the national, provincial and regional levels. This paper aimed to describe the rationale and methodology for developing risk predictive models for population risk of suicide.

Methods and analysis A case–control study design will be used to develop sex-specific risk predictive models for population risk of suicide, using statistical regression and machine learning techniques. Routinely collected health administrative data in Quebec, Canada, and community-level social deprivation and marginalisation data will be used. The developed models will be transformed into the models that can be readily used by policy and decision makers. Two rounds of qualitative interviews with end-users and other stakeholders were proposed to understand their views about the developed models and potential systematic, social and ethical issues for implementation; the first round of qualitative interviews has been completed. We included 9440 suicide cases (7234 males and 2206 females) and 661 780 controls for model development. Three hundred and forty-seven variables at individual, healthcare system and community levels have been identified and will be included in least absolute shrinkage and selection operator regression for feature selection.

Ethics and dissemination This study is approved by the Health Research Ethnics Committee of Dalhousie University, Canada. This study takes an integrated knowledge translation approach, involving knowledge users from the beginning of the process.

  • Suicide & self-harm

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


  • This study will use routinely collected health administrative data, which are readily accessible to policy and decision makers.

  • The candidate predictors include variables at individual, healthcare system and community levels, which reflect the complex aetiology of suicide.

  • The methodology of model development and validation needs to be improved.

  • Some individuals in the control group might have suicide behaviours, which could not be ascertained by health administrative data.

  • Important factors such as education, employment and income are not routinely collected by health administrative databases, which is a limitation of this study.


Suicide is a major international public health problem. Each year, over 4500 Canadians take their own life,1 and more than 700 000 people die because of suicide worldwide,2 imposing enormous impacts on families, communities and societies. As such, suicide prevention has been a top priority of many countries.

Suicide has a complex aetiology and is a result of the interaction among the risk and protective factors at the individual, healthcare system and population levels.3–10 Therefore, policy and decision makers and mental health service planners can play an important role in suicide prevention. To facilitate suicide prevention planning, mechanisms should be in place that enable policy and decision makers to make informed decisions and mobilise resources to high-risk populations at the right places, before tragic events occur. This vision requires us to shift the paradigm from predicting individual risk to predicting population risk of suicide. However, the existing suicide risk assessment/predictive tools are not suitable for predicting population risk. Most of the existing risk assessment/risk predictive tools for suicide were designed to be used by clinicians; they were not designed for policy and decision makers.11 Clinicians often use these tools to determine if individual patients are at high risk of suicide presently or in short term (eg, next week). On the other hand, policy and decision makers are more concerned about the rate of suicide at the community level (eg, health regions, provinces/states) in the medium or long term (eg, in the next 5 or 10 years), driven partly by budgetary decisions that are often made on a yearly basis. Clinicians and policy/decision makers may have different emphases on risk predictive tools as well. For clinicians, an ideal suicide risk predictive tool should have high discriminative power (eg, a large C statistics), high sensitivity, specificity and positive predictive value. For policy and decision makers, a tool with excellent calibration (ie, how closely the predicted risk agrees with actual risk in the population) is more useful. To facilitate policy development in suicide prevention at the population level, risk predictive models specifically designed for policy and decision makers are needed.

Ideally, risk predictive models for population risk of suicide are based on large data from the target population. For example, Gradus et al developed sex-specific machine learning (ML) algorithms for suicide using data from eight Danish national health and social registries which cover more than 90% of the Danish population.12 Kessler et al’s ML algorithms targeted US Army soldiers who were hospitalised.13 Accordingly, these risk predictive algorithms may potentially be used for forecasting the risk of suicide in Danish general population and in the US Army population, respectively. Furthermore, predictive models for population risk may use not only individual data, but also health system-level (eg, quality of mental healthcare, mental health budget) and community-level data (eg, unemployment rate and social deprivation levels in the community). For instance, Marks et al developed a predictive model for identifying counties at high risk of overdose mortality, which included county-level education, poverty rate, unemployment rate, overdose gravity and other county-level indicators, among the 3106 counties in the USA.14 Given the complex aetiology of suicide, predicting population risk of suicide may benefit greatly from the integration of data at the individual, health system and community levels.

We undertook a project to develop and validate sex-specific risk predictive models to be used by policy and decision makers to forecast population risk of suicide at the health region level, using routinely collected health administrative data, and to identify the barriers and facilitators to implementation and explore the ethical and privacy issues of the prediction program. In this manuscript, we aimed to describe the methodology of the project, to inform methodological discussions and suicide prevention strategies.


This project encompasses the components of quantitative and qualitative investigations and an integrated knowledge translation (IKT). IKT is a model of research co-production, whereby knowledge users are integrated throughout the research process and who can use the research recommendations in practice or policy.15 IKT approaches are used to improve the relevance and impact of research. The quantitative research involved developing and validating risk prediction models for suicide using advanced ML and visualisation methods. The qualitative research is to understand the potential implementation, social, ethics and legal issues associated with the risk prediction program. In line with IKT principles, we involved policy and decision makers at the provincial and national levels, and people with lived experience of suicidality from the beginning of the project. The methodology of each component is described below.

Model development and validation

Target population

The general population residing in the province of Quebec, Canada. The province had a population of over 8.6 million people in 2021, and about 95% of the population reported being able to conduct a conversation in French. In Quebec, health services are planned and delivered through 18 health regions, 22 integrated health and social services centres and 166 Centres locaux de santé Communautaire. Budgetary decisions are made at the levels of province and health regions/integrated health and social services centres.

Data sources

We will develop the prediction tools by linking the suicide database, the Ministry of Health and Social Services public financial reports (Contour financier-Publications du ministère de la Santé et des Services sociaux (, which include the five health administrative databases below, and the Canadian Urban Environmental Health Research (CANUE) data. The suicide database gathers individual-level data annually based on residents’ health insurance number from five administrative databases: the vital statistics death database, the physician claims database, the hospital discharge database, the Insured Person Registration File and the public drug plan. The data of these databases (eg, billing and service procedures codes, service dates) are routinely submitted by clinics and hospitals for billing and administration purposes; no self-reported data were collected from patients. These databases cover up to 98% of the population in Quebec and contain data for over 20 000 death by suicide cases that occurred since 1996. Death by suicide cases were those ascertained by Quebec’s Coroner office after investigation. The decision is registered in the Quebec vital statistics database. The latter is linked with other health administrative databases of the Quebec Integrated Chronic Disease Surveillance System (QICDSS) managed by the Quebec’s Public Health Agency.5 With the suicide database and other linkable Ministry financial databases, individual (eg, sex, age), programme (eg, hospitalisation, emergency department visits) and system (eg, mental health and addiction budgets) level indicators can be identified.5

CANUE is a Canadian consortium aiming to build a unique repository of standardised metrics of urban, suburban and rural characteristics, as well as the tools used to produce them ( The CANUE data contain indicators for unemployment, social deprivation, access to health services and built environment at the community level, and can be linked with health administrative data by postal codes. The CANUE is open and free for research projects. The data linkage was performed at the Quebec Institute of Public Health (INSPQ) where the suicide data are kept. Linking the databases provides an unprecedented sample size and the capability of examining individual, neighbourhood, programmatic and systemic indicators of population suicide risk.

Because this study used existing de-identified health administrative data, informed consent from individual patients was waived. This study was approved by the Research Ethics Board of Dalhousie University.

Study design

Because the base rate of suicide in the population is low, we proposed using a case–control study design to develop sex-specific suicide risk predictive algorithms, using both logistic regression modelling and ML techniques. We selected all death by suicide cases that occurred from 1 January 2002 to 31 December 2010.16 The control group was a 1% random sample of living individuals in each year between 1 January 2002 and 31 December 2010 from the Quebec physician claim database. Controls are not allowed to be selected more than once across years. None of those in the control group died of suicide during this period. The cases and controls were not matched to allow for maximum variability in predictors.


Individual, programmatic, systemic and community factors (see online supplemental appendix 1) that happened 5 years prior to the suicide events will be used as candidate predictors to develop the risk predictive algorithms. For example, we extracted the data about the diagnosis of major depression (an individual-level factor) in the past 6, 12, 24, 48 and 60 months, as five separate candidate predictors. Similarly, we extracted mental health and addiction budget of each health region (a systemic-level factor) in the past 5 years as candidate predictors. The QICDSS17 provided all the variables drawn from health administrative databases. It covers 98% of the Quebec’s population since 1996. The security and continuous quality and maintenance are the responsibility of the INSPQ. Information is for administrative (ie, age, hospital or outpatient contact dates) and clinician reporting (ie, diagnoses) purposes. Validation of QICDSS physical diagnoses has been achieved by chart reviews17 and by outcomes for QICDSS psychiatric diagnoses.18 19 The QICDSS has been exploited over the past decade by a network of INSPQ officers and academic researchers, many are coauthors of publications on the characteristics of patients receiving rare psychiatric interventions,20 and on personality disorders, schizophrenia and substance use disorders in relation to mortality, including suicide.21 22 The quality of the data is also reflected by the minimal missing data associated with the variables, which range from 0.87% and 4.12% of the variables in the databases.

The initial selection of candidate predictors is determined by content knowledge (ie, known relationships between suicide or suicide behaviours and individual and local area-level variables), feasibility of routine data collection, clinical utility and policy relevance through team meetings. Therefore, the predetermination of candidate predictors was a joint effort between the team members, collaborators, health policy and decision makers and other stakeholders, with the expertise of clinical psychiatry, psychiatric epidemiology, mental health services research, health administrative data, computer science and mental health policy.

For the objective of this study, we will use both statistical (eg, logistic regression modelling) and ML approaches to develop the risk prediction models so that we may compare which approach performs better in predicting population suicide risk and is more feasible to implement. ML can produce complex estimations by searching data for relevant pieces of information and their complex interactions. Therefore, ML is best suited to tackle the combined challenges of high-dimensional data analysis associated with risk prediction for suicide. Some predictors that may change over time (eg, diagnoses, medications, service use, etc) will be dummy-coded to create time-varying predictors (ie, intervals of 0–3, 0–6, 0–12, 0–24, 0–36, 0–48 and 0–60 months before the first day of the suicide month). Because we included all suicide cases and a sample of controls, the proportion of suicide in the sample is different from that in the general population. This is a limitation of case–control study design which produces a biased sample because the proportion of cases in the sample is not the same as the population of interest.23 24 One method for addressing this limitation when developing predictive models using case–control data is weighting.24–27 Therefore, in logistic regression modelling, sampling weights (inverse probability of being selected) were assigned to the controls, while the weight of 1 was assigned to the cases, to ensure the models are applicable to the whole population.

Model development: ML

ML is a part of artificial intelligence that aims to construct systems that automatically improve through experience using advanced statistical and probabilistic techniques. ML has provided significant benefits to a range of fields. Recent research has shown a range of advantages of ML that can assist in detecting, diagnosing, predicting suicide and treating mental health problems.28 29 ML methods are divided into categories, that is, supervised, semisupervised, unsupervised and reinforcement.

Imbalanced classes are a common problem in ML classification, where each class has a disproportionate ratio of observations. To predict the population risk of suicide, dataset will be imbalanced because of rare cases of suicide as compared with a control group. To address the imbalanced dataset, we will oversample the minority class. We will ‘artificially’ duplicate samples from the minority class to oversample the minority class to correct imbalanced datasets, even though doing so does not provide the model with any new data. In the literature, this method was known as the Synthetic Minority Over-sampling Technique. Then, we will develop supervised learning models such as logistic regression, random forest, XGBoost and multilayer perceptron with an optimised model architecture. These models’ predictive capacity will be assessed by generating the receiver operating characteristic curves calculating its area under the curve and various operating characteristics, including sensitivity, specificity and positive predictive value for a variety of thresholds.

Interpretability is essential when we deal with healthcare data. It is significant because it is necessary to understand the casualty of learnt representations for decision support also helps to assess whether the model is considering the right features while making a specific prediction. Feature-based model explainability technique, such as Shapley Additive Explanations (SHAP), was derived from game theory; each player decides to contribute to a coalition of players to produce a total value that will be superior to the sum of their individual values. SHAP relies on the Shapley value of both local and global explanations. Shapley’s values are model-agnostic, and the marginal contribution of each feature can be calculated by using the input data and the predictions.30 31 SHAP will use the global explanation of how much the input features contribute to a model’s output.

Model development: logistic regression

As the first step of model development, we will include all preselected variables in penalised least absolute shrinkage and selection operator (LASSO) regression. The LASSO penalisation factor selects important predictors by shrinking coefficients for weaker predictors toward zero, excluding predictors with estimated zero coefficients from the final sparse prediction model. We will perform a correlation analysis among variables selected by the LASSO regression, and identify variables that are strongly correlated (eg, γ ≥0.60). Correlated variables will be discussed by team members, and the variables that have better policy implication and clinical utility will be kept and become the candidate predictors for model development.

We will use logistic regression to develop the sex-specific statistical models. After LASSO, there may still be a large number of candidate predictors. Backward selection method will be used to eliminate unpredictive variables and to identify the model with the best calibration and discrimination. The decisions of model selection will be initially based on the changes in the values of Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).32 Since BIC penalises for the complexity of the model more than AIC, selection with BIC will generally lead to smaller models than selection with AIC.32 Once a model is developed, prediction accuracy will be assessed by the discrimination and calibration of the model. Discrimination is the ability of a prediction model to separate those who experienced the outcome events from those who did not. We will quantify this by calculating the C statistic, analogous to the area under a receiver operating characteristic curve. Calibration measures how closely predicted outcomes agree with actual outcomes. For this, we will use D'Agostino’s version of the Hosmer-Lemeshow Χ2 statistic. Discrimination and calibration compete with each other. Given that the program will be used to forecast population risk of suicide, we will prioritise calibration over discrimination. Stakeholders from different perspectives and scientific backgrounds will meet to determine the content and performance of the risk prediction models developed by statistical and ML techniques, the appropriate formats of data visualisation that are acceptable to policy and decision makers, and the feasibility of implementation, which will in turn inform the revision of the models.

The second step of the model development is to estimate the synthetic rates, consisting of two stages. First, for each predictor, the proportions of individuals within each category of that predictor in the initial modelling will be computed, separately by regions. For instance, if hospitalisation due to suicide attempt in the past 5 years is a predictor in the model, the proportion of individuals with this attribute in a specific health region is calculated. If age is a continuous variable in the model, the mean age of the population in a health region is estimated. A syntax program will then be prepared to apply the regression coefficients to the corresponding proportions and means in the dataset, and to calculate the logit estimates for each of health regions. The resulting logit values for each of the health region will then be converted into probabilities, giving the estimated risk of suicide in the health region. The region’s population counts from Statistics Canada Census data or the provincial health administrative database multiplied by the estimated risk will yield the estimated number of suicide in this health region.

The fitted logistic regression model described above estimates the proportion of suicide in the population at a given moment of time as a function of its risk factors in the past. This model is fundamentally aetiological, where the natural reference point is the moment of the outcome’s occurrence, corresponding to the zero time on the aetiological time scale. However, assessment of population risk of suicide over a particular span of time in the future involves a prognostic outlook, where the natural reference point is the time of prognostication, corresponding to the zero time on the prognostic time scale. Predictive models for individual risk are often developed using a cohort/closed study population and express the risk of future occurrence of the outcome as a function of current risk factors, and involve consideration of the values of the risk factors at issue at the prognostic time zero only. On the other hand, population risk models are applied in the context of a dynamic/open population and the estimated risk is a function of risk factors not only at the prognostic time zero but also throughout the time span at issue. For example, the risk of suicide in the next 5 years in a health region may not only depend on the proportions of people with major depression and of hospitalisation due to suicide attempt in the past, but also on whether there will be a reduction or increase in these parameters over the next 5 years, if so in which year. Thus, the population risk of suicide may be projected using the developed model to each future year over a predefined time interval. The cumulative incidence of suicide (CI0 to t) from time T=0 to T=t can be estimated as a function of time-specific and profile-specific risk operating over that time interval27:

Embedded Image

The estimated cumulative risk represents the estimated risk of suicide of a health region over the time period at issue conditionally on the health region’s risk profile.


For model validation, we will use the suicide data from 1 January 2011 to 31 December 2019. We will first calculate the yearly, 5-year and 10-year incidence of suicide death at the provincial and health regional levels in males and females (ie, observed risk). We will apply the developed synthetic models in the validation data to estimate the yearly, 5-year and 10-year incidence of suicide death at the provincial and health regional levels in males and females (ie, predicted risk). We will visually compare and calculate the differences between the predicted and observed risks; smaller differences indicate better calibration with the data and model accuracy. We will use four indicators for assessing model performance: mean average error (MAE), root mean square error (RMSE), Spearman’s r and proportion of correct identification of high-risk regions.14 The MAE is the average magnitude of the difference between the predicted and observed suicide death rate for each health region. The RMSE is the square root of the average magnitude of the difference squared, therefore is similar to MAE but penalises prediction errors with greater magnitude. More accurate predictions will result in smaller MAE and RMSE. Spearman’s r compares the predicted ranking of health regions by suicide death rate compared with the actual observed rankings; results closer to 1 indicate that the model was more effective at rank-ordering regions based on suicide death rate. To assess the extent to which high-risk regions are correctly identified, we will first disaggregate the predicted and observed suicide rates into quartile groups and categorised all health regions into their corresponding quartiles for both predicted and observed suicide rates. The proportion of health regions observed in the top quartile of observed suicide death rates that were rightly predicted to be in the top quartile will be calculated.

Qualitative study

The objective of the qualitative study is to investigate the end-users’ views about predicting population risk of suicide, and the potential social, legal, ethical, and privacy issues and mitigation strategies for implementing such a predictive system. Using snowballing techniques, we have invited policy and decision makers at the federal and provincial levels, mental health professionals, individuals who have extensive experience in working with policy and decision makers and who have expertise in suicide prevention, social and health policy, as well as health administrative data, people with lived experience and advocates for families bereaved by suicide. The qualitative study consists of two rounds of interviews. The first round of interviews was carried out after the general team meeting held in July 2021, at which the study design was finalised. The second round of interviews will be organised once the predictive models are developed. The first round of interviews was held through zoom meetings, and followed a series of semistructured interview questions related to the objectives (see online supplemental file 1). Qualitative data collected during the focus groups and qualitative interviews are audio recorded, transcribed and analysed with the support of QDA Miner (Provalis).33 The second round of interviews will be conducted once the prototype models are developed and presented at the second general team meeting which is to be held in late 2022. We will perform an inductive thematic analysis of the focus group and individual interview material, which will be fed by answers to the open questions regarding potential (1) perceptions about the developed prediction models, (2) social issues, (3) legal issues, (4) ethical and privacy issues, and (5) mitigation strategies for implementing such a system. Transcripts will be coded in order to demarcate segments within each of them. We will look for words or short phrases that demonstrate how the associated data segments inform our research objectives. Detailed results from the qualitative analysis of this material will be presented in a separate paper.

Patient and public involvement

Engagement with relevant stakeholders (eg, policy/decision makers and people with lived experience) through IKT is critical for developing equitable risk predictive algorithms and for maximising the potential for future implementation. For this project, we have identified and engaged policy/decision makers from the Public Health Agency of Canada and from the INSPQ, as well as eight people with lived experience. The representatives of INSPQ (EP, PL, VM, LR) were involved in study conceptualisation and grant application. PL has been facilitating data extraction and participated in the biweekly team meetings. As described above, we have engaged people with lived experience through the qualitative interviews. The next round of qualitative interviews will be held after the prototype of the risk predictive models is developed to have a better understanding about privacy, ethics and implementation issues.

Ethics and dissemination

This study will use routinely collected health administrative data. The analysis of secondary de-identified data at the INSPQ where the data are kept will not incur physical and psychological harms. The results of the study will be vetted by analysts at the INSPQ to ensure no privacy and confidentiality will be breached. The data used for this study will be kept at INSPQ for 15 years. The results will be presented in peer-reviewed journals, at academic conferences and shared with knowledge users who were engaged from the beginning.

Through this study, we aimed to develop risk prediction models to be used by policy and decision makers to forecast population risk of suicide at the provincial and health region levels, using routinely collected health administrative data and other publicly available area-level data. For example, policy and decision makers may use the models to project the proportion and number of suicide deaths in specific health regions/communities over the next 5 years, and decide how resources and community-level interventions may be mobilised to the high-risk regions/communities. Furthermore, the models can inform policy and decision makers about the potential impacts of these community-level interventions on suicide prevention. The potential utility of such predictive tools has been attested by the active involvement by the policy and decision makers at the federal and provincial levels and people with lived experience. Nevertheless, predicting population risk of suicide is new and has not been well studied. There are a number of methodological and implementation challenges to be addressed.

Routinely collected health administrative data and population health survey data represent a unique opportunity for population health projection because it covers a majority of the general population in catchment areas, and the data can be readily accessed by policy and decision makers. Many risk predictive models have been developed for physical and mental health problems in the general population. For example, individual data from population health surveys and health administrative databases have been used to develop risk predictive models for diabetes,34 heart disease35 and major depression.36 37 These models may be used to identify high-risk individuals in the community; they can also be used to forecast the population risk in the future. However, few models have integrated individual, healthcare system and community-level predictors in the same model. In this study, we proposed including data from these different levels in model development, and converting the models into synthetic estimation models. There may be different approaches for integrating data from different levels for population risk prediction. Future studies are needed to explore the best method for data integration.

The performance of a risk predictive model is commonly assessed by indicators of model discrimination and calibration.38 Whereas model discrimination is critical for individual risk predictive models, policy and decision makers’ focus is on the whole population rather than individuals. Therefore, model calibration plays a more important role in the performance of a population risk model. We proposed four indicators for assessing model performance. However, it is not clear how much error (the difference between predicted and observed risks) policy and decision makers may tolerate for population risk prediction, how they perceive the importance of model discrimination and whether other indicators exist for assessing population risk prediction models. We will explore these aspects through our qualitative study, and also encourage others to consider these in future studies. Similarly, we welcome discussions and debates about the methods for validating population risk predictive models. An individual risk predictive model is often developed using longitudinal cohort/closed population data and validated in a different but related cohort/closed population. This poses challenges for population risk predictive models because the population in a community/health region is open and dynamic. Appropriate methods for model validation and acceptability need to be developed and agreed by the research community and policy and decision makers.

This study relied on routinely collected health administrative data for model development and validation, rather than collecting primary data. Therefore, we have little information about suicide behaviours among the individuals in the control group, which are strongly associated with suicide deaths. In the model development, we included hospitalisation and emergency department visits due to suicide attempt, which may reduce the bias related to the lack of information about suicide behaviours. Nevertheless, this is a limitation of routinely collected health administrative data.

Despite the challenges for developing population risk predictive model for suicide, research is urgently needed to address this important population health issue. This study represents one of the early steps in building such risk predictive models and methodology development, as part of the collective efforts for moving the field forward.

Ethics statements

Patient consent for publication


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors JLW drafted the manuscript. JLW, FGZK, J-FP, LR, EP, PL, GG, CG and AL were involved in study design, conceptualisation and funding application. JLW, FGZK, J-FP, LR, EP, PL, VM, CB-P, MM, GG, CG and AL were involved in manuscript review, discussion, revision and final approval.

  • Funding This study is supported by a New Frontiers for Research Funds grant (2019-00471) from Tri-Agency Institutional Programs Secretariat, Government of Canada and by a Tier I Canada Research Chair award to JLW.

  • Disclaimer The funders play no role in the design and operation of this study.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.