Article Text
Abstract
Introduction While there are guidelines for reporting on observational studies (eg, Strengthening the Reporting of Observational Studies in Epidemiology, Reporting of Studies Conducted Using Observational Routinely Collected Health Data Statement), estimation of causal effects from both observational data and randomised experiments (eg, A Guideline for Reporting Mediation Analyses of Randomised Trials and Observational Studies, Consolidated Standards of Reporting Trials, PATH) and on prediction modelling (eg, Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis), none is purposely made for deriving and validating models from observational data to predict counterfactuals for individuals on one or more possible interventions, on the basis of given (or inferred) causal structures. This paper describes methods and processes that will be used to develop a Reporting Guideline for Causal and Counterfactual Prediction Models (PRECOG).
Methods and analysis PRECOG will be developed following published guidance from the Enhancing the Quality and Transparency of Health Research (EQUATOR) network and will comprise five stages. Stage 1 will be meetings of a working group every other week with rotating external advisors (active until stage 5). Stage 2 will comprise a systematic review of literature on counterfactual prediction modelling for biomedical sciences (registered in Prospective Register of Systematic Reviews). In stage 3, a computer-based, real-time Delphi survey will be performed to consolidate the PRECOG checklist, involving experts in causal inference, epidemiology, statistics, machine learning, informatics and protocols/standards. Stage 4 will involve the write-up of the PRECOG guideline based on the results from the prior stages. Stage 5 will seek the peer-reviewed publication of the guideline, the scoping/systematic review and dissemination.
Ethics and dissemination The study will follow the principles of the Declaration of Helsinki. The study has been registered in EQUATOR and approved by the University of Florida’s Institutional Review Board (#202200495). Informed consent will be obtained from the working groups and the Delphi survey participants. The dissemination of PRECOG and its products will be done through journal publications, conferences, websites and social media.
- Health informatics
- Protocols & guidelines
- Information technology
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
There are no guidelines for the reporting of data-learnt prediction models that have the specific intent to calculate alternative scenarios (counterfactuals) and identify individualised effects of interventions.
Prediction of Counterfactuals Guideline (PRECOG) will fill a gap in reporting standards for counterfactual prediction modelling and will capitalise on the systematisation and quality of the Enhancing the Quality and Transparency of Health Research network.
PRECOG will be built on diverse (clinical researchers, computer scientists, epidemiologists, statisticians) expertise consensus across multiple development stages.
Even with rigorous study design, execution and reporting standard, causal claims made on observational data analyses might be still mistaken by wrong assumptions or unmeasured, hidden bias.
Introduction
The increasing availability of large electronic health record data has led to an explosion in the development of prediction models—both traditional statistics and machine learning—for diagnostic, prognostic and treatment optimisation purposes. Despite the availability of reporting guidelines, for example, ‘Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis’ (TRIPOD),1 the quality of many studies is low, as well as adherence to reporting standards, and there is often a misinterpretation of the models’ operating capabilities, with possible misuse and harm at the individual and/or population level.2 3 One of the most common mistakes4 5 is to consider a prediction model readily usable for interventions on individuals, by changing certain variables with the intent to improve outcomes, that is, calculating alternative scenarios or so-called counterfactuals. Since prediction models are often learnt from observational data, there is no guarantee that the strongest predictors are causing the outcome of interest and are not confounded, mediated by others, or actually concomitant causes of it. While such bias is not a problem for mere prediction in similar populations—since variables are not being changed with the intent to modify risk—it becomes problematic in new, out-of-distribution populations (even when cross-validation performance is high)6 and when trying to optimise outcomes.7
Thus, formal causal assessment is needed when developing prediction models on observational data to be used for alternative scenarios and interventions, that is, counterfactual prediction models. The approaches from traditional statistics, computational science and econometrics, including the potential outcomes framework,8 do-calculus and directed acyclic graphs,9 are often focused on estimating a population-level causal effect for a single interventional query (treatment or exposure) but can be used to calculate individualised treatment effects and counterfactuals.10–15 Machine learning has also been employed for counterfactual prediction.16 17 Several off-the-shelf methodologies have been revisited, including deep learning18–20 and random forests.21
Given the rise in counterfactual prediction modelling studies, there is a need for common grounds on model reporting, to improve overall quality (although adhering to a protocol might be necessary, yet not sufficient condition to study quality), and specifically on transparency and reproducibility of results.
In the ‘Enhancing the Quality and Transparency of Health Research’ (EQUATOR) network (https://www.equator-network.org/), there are guidelines specifically designed for reporting causal effects on randomised clinical trials (RCTs), for example, ‘Consolidated Standards of Reporting Trials’22 and ‘A Guideline for Reporting Mediation Analyses of Randomised Trials and Observational Studies’.23 Reporting guidelines for observational studies also mention causal effects inference, for example, ‘Strengthening the Reporting of Observational Studies in Epidemiology Using Mendelian Randomisation’,24 ‘Reporting of Studies Conducted Using Observational Routinely Collected Health Data Statement for Pharmacoepidemiology’25 and the ‘Instrumental Variable Methods in Comparative Safety and Effectiveness Research’.26 Outside of EQUATOR, the Patient-Centered Outcomes Research Institute (PCORI) (https://www.pcori.org/) provides ‘Standards for Causal Inference Methods in Analyses of Data from Observational and Experimental Studies in Patient-Centered Outcomes Research’.27 Also, there are guidelines for estimating causal effects in pragmatic randomised trials.28 Worth noting is the ‘predictive approaches to treatment effect heterogeneity’ (PATH) statement,29 which—although focused on RCTs—examines treatment effect heterogeneity by considering as effect modifier(s) either the risk or the covariates, with both strategies aimed at guiding treatment decisions. PATH provides guidance for specific multivariable regression configurations and warns against more ‘aggressive’ approaches (eg, machine learning models with many df) that could bring overfitting. Overall, existing guidelines are not well fitted for causal and counterfactual prediction modelling for observational biomedical data (or a mixture of RCTs and observational), although a number of them contain elements that are directly related.
Consequently, we aim to develop a new reporting guideline, which we tentatively name as PRECOG—acronym for ‘Prediction of Counterfactuals Guideline’. The primary focus of PRECOG is to provide guidance on how to report causal assumptions as well as evaluate derivation/validation of models—involving at least an observational data source—that provide predictions of individualised treatment/intervention effects in the form of potential outcomes. On the one hand, the development of these models can follow both risk-and-effect-modelling approaches as in PATH, but it is intended to be more general, allowing any functional form and data generation process. On the other hand, the validation standard of these models falls within the TRIPOD scopes, but it also evaluates how they are suitable for optimisation (eg, treatment decision, risk reduction) in addition to diagnosis and prognosis, trusting on the counterfactuals backed up by the causal claims. PRECOG is also expected to provide guidance on software implementation and interoperability. As a quality evaluation instrument, PRECOG can help researchers (and general readers, peer reviewers, journal editors) as well as policy-makers to carry out and critically appraise causal and counterfactual prediction modelling studies. We anticipate further expansion of the guideline for specific areas, for example, pharmaceutical interventions. The primary use cases of PRECOG are expected to fall within biomedical sciences, but they could be applied to other fields such as psychology or economics.
Methods and analysis
PRECOG will be developed following published guidance from the EQUATOR network.30 We will develop the guideline in five stages, as shown in figure 1: (1) meeting of a working group every other week; (2) scoping/systematic review of causal and counterfactual prediction modelling studies; (3) reporting checklist draft and real-time Delphi exercise; (4) development of the final guideline and (5) peer-review, publication and dissemination. These stages are drawn from prior, successful development studies, in primis the protocol used for the making of TRIPOD-Artificial Intelligence (AI) and Prediction model Risk Of Bias ASsessment Tool (PROBAST)-AI.31 The expected timeline for stages 1–4 is 1 year, using 6–9 months for stages 1–2, and 3–6 months for stages 3–4.
Stage 1: working group setup and meetings
The core working group is composed of the coauthors of this protocol description, who met every other week (30–45 min) since 13 September 2021, to discuss the development of the protocol itself, prepare documentation for the institutional review board, registration to EQUATOR and eventually will carry out the PRECOG development after approvals and publication of the protocol description.
Then, the working group will be expanded with external advisors with expertise in biomedical informatics, (bio)statistics, causal inference, computer science, epidemiology, health economics, health outcome research, standards and related areas. Each member of the core working group will identify one or more suitable external advisors, who will be invited to participate in the meeting and prompted to suggest further advisors, likely reaching 10–15 experts in total. The list of advisors will also be used for stage 3 (real-time Delphi exercise). The expanded working group will make its best efforts to assure diversity, variety in career stages, geography, gender, race and multicultural representation. The extended working group will also meet every other week, and each meeting will ideally be composed of 3–7 people, rotating participants, with at least one external advisor present (otherwise be rescheduled). The rotation and size limit of participants in a single meeting is built on our prior experience with qualitative research, specifically focus groups, where compact size and diversified expertise aid to better reach data saturation.32 33 The working group will work on: (1) review of existing EQUATOR/PCORI reporting guidelines related to prediction modelling and treatment effect estimation; (2) evaluation of published scoping reviews of counterfactual prediction modelling studies for biomedical sciences and development of a new systematic review; (3) drafting of the initial reporting checklist for the Delphi survey; (4) review of the survey and development of the final guideline; (5) manuscript writing and (6) submission of the products to peer-review, publication and dissemination.
Stage 2: literature review of counterfactual prediction modelling studies
The purpose of the literature review is twofold: (1) to build a knowledge base on study design, methodological approaches, use cases and reporting commonalities among causal inference and counterfactual prediction studies in biomedical sciences; and (2) to help the development of reporting items for PRECOG. A subset of the working group members will concentrate on the review. Lin et al34 published a scoping review on causal methods for predictions under hypothetical interventions, screening nearly 5000 papers and focusing on 13 key articles, including traditional statistical as well as machine learning modelling. Most works used marginal structural models and g-computation. The authors concluded that ‘techniques for validating causal prediction models’ are still in their infancy’. Based on the results from the scoping review, and expanding the search strategy and the article sources, the team is going to move forward with a systematic review. The review will provide counts on methodology, review and applied papers, but then will focus on works that include at least one observational data source and an application use case, further deepening the validation strategies. The planned reporting statement of choice is the ‘Preferred Reporting Items for Systematic Reviews and Meta-Analyses’,35 and the working group will register the work in the ‘Prospective Register of Systematic Reviews.36
As part of the review, we foresee discussing how to assess the potential risk of bias (which can lead to misuse and patients’ harm), and if current tools such as ‘PROBAST’ are appropriate.37
Stage 3: real-time Delphi exercise
We will conduct a real-time Delphi survey38 to review and refine the items of the PRECOG reporting checklist. Participants will be identified initially through the professional network of the core working group and of the external advisors, and further via literature search (including but not limited to the existing scoping review and the planned systematic review), social media screening and snowballing by the active participants. As for the expanded working group composition, participants will be invited from diverse and multicultural backgrounds and different countries. Invitees will include academics at various career stages, researchers and investigators from non-profit and for-profit organisations, programme officers from national/federal funding agencies, entrepreneurs, healthcare professionals, journal editors, policy-makers, healthcare regulators and end-users of predictive models. The participant selection will be based on area expertise grouping (computer science, biostatistics, biomedical informatics, statistics, epidemiology, standards, causal inference, ethics), used to determine the sample size (discussed below). We choose a computer-based, real-time Delphi,38 since it offers some operational advantages with respect to conventional multi-round Delphi techniques, for example, responder’s attrition.39 In brief, real-time Delphi is a ‘roundless’ exercise based on an online survey platform. Participants can access and modify their responses at any time during the survey time frame, and they can view the survey summaries calculated among all responders. In this way, participants can see if/how their opinion is unpopular and add further comments to support their cases.
The working group will develop an initial reporting checklist for PRECOG, based on the EQUATOR developing standard and existing related guidelines/statements. We anticipate that PRECOG will draw substantially from the reporting items of TRIPOD as well as the recommendations of PATH; however, we expect major differences rather than a simple merge. For instance, performance evaluation as recommended in TRIPOD should be modified to include specific metrics such as the Precision Estimation of Heterogeneous Effects,40 and emphasise out-of-distribution validation. Another important aspect is the causal assumptions. PATH relies on RCTs, where randomisation supports the strong ignorability of treatment assignments, while PRECOG models might be exclusively built on observational data (or a mixture of observational and RCT data) and a justification for causal claims will need to be provided.
An anonymous online survey will be created where each checklist item can be evaluated in relation to its importance and relevance for the guideline, using a five-point Likert scale, and a free text box for comments. Also, at the end of the survey, another text box will allow more generic comments and propositions, for example, new items to be added to the checklist. When a participant consents to participate and completes the survey for the first time, they can view the summary of all responses to date and can access the survey again within the next 6 weeks. The survey is closed after the required sample size is reached, or a maximum of 6 weeks are passed from the last recorded first response.
There is no consensus on the sample size of a Delphi panel but a minimum number of 10–18 panel members per area of expertise has been recommended.41 We will aim to reach a minimum sample size of 60 considering the aforementioned background expertise areas, compiling a list of 80–100 potential participants for the recruitment. At the end of the Delphi survey, the expanded working group will review the results and consolidate the checklist through a consensus meeting. The workgroup will also decide on the consensus rule. In general, for items ranked on a five-point Likert scale, the consensus rule is 80%,42 but there can be differences in how adjacent items are grouped or weighted toward consensus.43 For instance, Naughton et al44 quantified the Likert points from 1 (most important) to 5 (least important), and defined consensus for items scoring a median of 2.5 or less overall, when at least 80% of responders gave 1–3 points. More recent works proposed entropy-based consensus.45
Stage 4: development of the guideline and related products
On finalisation of the reporting checklist from the Delphi exercise, the extended working group will develop the full PRECOG guidelines. The manuscript will be posted to a public preprint website, for example, bioRxiv or medRxiv, before submission to a peer-review journal and possibly presented as an abstract/poster in major international conferences, for example, the annual conference of the American Medical Informatics Association or the Society for Epidemiology Research. It is expected that the PRECOG initiative will produce at least the following papers:
Guideline development protocol (this work).
A systematic review of causal and counterfactual prediction models in biomedical sciences.
PRECOG guideline.
Stage 5: publication and dissemination plan
After being posted on preprint servers, the aforementioned manuscripts will be submitted to peer-reviewed international journals for final publication. The authors’ list will be determined based on effective individual contributions, following the ‘contributor roles taxonomy’ (CRediT) (https://casrai.org/credit/), and might include additional contributors other than the working group members and external advisors. The dissemination strategy will be discussed during the workgroup meetings. In addition to conferences and publications, it is likely that social media platforms such as Twitter will be leveraged to inform on the PRECOG availability and utility.
Patient and public involvement
This study does not include patients. However, the participants of the working groups—by definition—will be involved in the design of the Delphi survey, in its evaluation, and in the finalisation of the PRECOG guideline (including authorship in papers). The participants of the Delphi survey can provide not only an evaluation of items but suggest new ones and re-evaluate the items during the time when the survey is open.
Ethics statements
Patient consent for publication
Acknowledgments
We thank the TRIPOD coauthors Dr G Collins (U Oxford, UK) and Dr KG Moons (UMC Utrecht, NL) and Dr N Peek (U Manchester, UK) for expressing their interest to join the PRECOG working groups.
References
Footnotes
Contributors JX wrote and submitted the protocol description. YG performed an initial literature review on reporting standards. FW and HX performed an initial literature review on counterfactual prediction models. RL advised on protocol procedures and ethical review. JB and MP conceived the idea.
Funding This work has been in part supported by National Institutes of Health (NIH)-National Institute of Allergy and Infectious Diseases (NIAID) grants no. R01AI145552 and R01AI141810 (MP), by National Institute on Aging (NIA) grants no. R33AG062884-03 (RL and MP) and 5R21AG068717-02 (JB and YG), by National Cancer Institute (NCI) grants no. 5R01CA246418-02, 3R01CA246418-02S1, 1R21CA245858-01A1, 3R21CA245858- 01A1S1,and 1R21CA253394-01A1 (JB and YG), by the Centers for Disease Control and Prevention (CDC) grant no. U18DP006512 (JB, YG and MP), and by University of Florida Informatics Institute Seed grant.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.