Objectives Patients with severe spontaneous intracranial haemorrhages, managed in intensive care units, face ethical issues regarding the difficulty of anticipating their recovery. Prognostic tools help clinicians in counselling patients and relatives and guide therapeutic decisions. We aimed to methodologically assess prognostic tools for functional outcomes in severe spontaneous intracranial haemorrhages.
Data sources Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses recommendations, we conducted a systematic review querying Medline, Embase, Web of Science, and the Cochrane in January 2020.
Study selection We included development or validation of multivariate prognostic models for severe intracerebral or subarachnoid haemorrhage.
Data extraction We evaluated the articles following the CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies and Transparent Reporting of multivariable prediction model for Individual Prognosis Or Diagnosis statements to assess the tools’ methodological reporting.
Results Of the 6149 references retrieved, we identified 85 articles eligible. We discarded 43 articles due to the absence of prognostic performance or predictor selection. Among the 42 articles included, 22 did not validate models, 6 developed and validated models and 14 only externally validated models. When adding 11 articles comparing developed models to existing ones, 25 articles externally validated models. We identified methodological pitfalls, notably the lack of adequate validations or insufficient performance levels. We finally retained three scores predicting mortality and unfavourable outcomes: the IntraCerebral Haemorrhages (ICH) score and the max-ICH score for intracerebral haemorrhages, the SubArachnoid Haemorrhage International Trialists score for subarachnoid haemorrhages.
Conclusions Although prognostic studies on intracranial haemorrhages abound in the literature, they lack methodological robustness or show incomplete reporting. Rather than developing new scores, future authors should focus on externally validating and updating existing scores with large and recent cohorts.
- adult intensive & critical care
- statistics & research methods
Data availability statement
Data are available upon reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
This is the first systematic review of the methodological quality of prognostic tools for severe spontaneous intracranial haemorrhages managed in intensive care units.
A robust search strategy with no language restriction was performed, leading to a high number of eligible articles.
This systematic review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement, and we evaluated the articles following the Transparent Reporting of multivariable prediction model for Individual Prognosis Or Diagnosis statement to assess the tools’ methodological reporting and pitfalls.
This systematic review concerns two types of lesions intracerebral haemorrhages and subarachnoid haemorrhages that present different pathophysiologies and clinical courses but similar long-term consequences, leading us to suspect shared methodological issues.
We were not able to perform a meta-analysis due to the heterogeneity in the included models.
Severe spontaneous intracranial haemorrhages, managed in intensive care units (ICUs), are at high risk of developing complications such as rebleeding or cerebral ischaemia,1 2 leading to high morbidity and mortality. Intracerebral haemorrhages (ICH) have a mortality rate of 40% at 1 month,3 while subarachnoid haemorrhages (SAH) have a mortality rate of 25% at 10 years.4 Survivors have a high rate of vegetative state or severe disabilities.5 This serious statement highlights the initial issues specific to severe strokes and the challenge physicians and surrogates face in deciding to continue invasive care.6 7 Indeed, the question arises as to whether advanced resuscitation is justified when the future appears unfavourable.8 When considering a limitation of care, the essential issue is to prevent inaccurate self-fulfilling prophecies by predicting outcomes reliably.9 In such settings, an individual’s patient prognostic may be difficult to assess because of the multiplicity of risk factors involved in the evolution of severe intracranial haemorrhages. Multivariable prognostic scores could assist clinicians in counselling patients and relatives and guide therapeutic decisions.
Previous reviews of prognostic tools,10–14 popular in the field of neurocritical care, have not focused on injuries managed in ICUs, for whom the issue of advanced care pursuits is a concern. Indeed, scores are reliable when validated in the population of interest. They also did not address the methodological quality of the selected articles. The PROgnosis RESearch Strategy (PROGRESS) group recently proposed a framework for prognosis concerns15 16 that led to the Transparent Reporting of multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement.17 These recommendations efficiently summarised the process for developing and validating a prognostic scoring system.
The objective of our systematic review was to assess the methodology of existing prognostic tools of functional outcomes in patients with severe spontaneous intracranial haemorrhage managed in ICUs. We chose to conduct this systematic review for the two types of lesions (ICH and SAH). While their pathophysiologies and clinical courses are different, the consequences for long-term functional outcomes are similar. The questions that arise at the beginning of the ICU stay about patients’ future and the complex ethical decisions are similar. While prognostic models may differ, the way to develop them should follow a similar modelling process. We suspected that studies presenting prognostic tools share the same methodological issues.
Materials and methods
This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement (online supplemental table S1).18 We searched Medline, Embase, Web of Science and the Cochrane databases on 7 December 2017 and updated on 14 January 2020, without date restriction. We used a query based on Medical Subject Heading terms and keywords. Online supplemental file S2 outlines the detailed search strategy.
We included all-language studies focusing on adults with severe spontaneous intracranial haemorrhage (ICH or SAH) managed in ICU, or specified explicitly as ‘severe’ or ‘high grade’ injury. We did not include criteria on the location, the cause of the haemorrhage or the type of cases (primary or secondary haemorrhage). We did not include paediatric studies or studies uniquely concerning traumatic injuries. We searched for the development and/or validation of prognostic models, predicting outcomes using variables collected before or at the beginning of their ICU stay. The targeted outcomes were mortality, functional outcomes or quality-of-life-related outcomes from ICU-discharge or hospital-discharge through to long-term outcomes. Our non-inclusion criteria were reviews or meta-analyses, full texts not found or conference abstracts, models developed without predictor selection, univariate models or the lack of reported prognostic performance. One reviewer (JS-P) screened references by title and abstract. The full eligible texts were assessed independently by four pairs of reviewers (YF–ML, FF–RC, DF–LB-C and JS-P–ED) and discussions resolved any discrepancies.
We predefined a standardised form for data extraction and evaluation of the risk of bias (online supplemental tables S3 and S4). For each eligible article, we collected the author’s name, year and journal, data source and study design, inclusion and exclusion criteria, sample size, population characteristics, predicted outcomes (mortality, functional outcomes and quality-of-life), prediction time (ie, the time when one calculates the prediction), horizon time (ie, the end of the prediction time window), predictive tools, development details (such as variables of the scoring systems), internal validation details, external validation details, missing data information and open comments regarding bias and limitations.
Articles and prognostic tools selection based on quality assessment
To include the articles, we followed the CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) 19 and the TRIPOD statements.20 Specifically, they recommend developing a score from a learning sample set and validating the prognostic performance from an independent sample (internal and/or external validation). This step avoids reporting the prognostic capacities on the training sample only because no internal nor external validation led to overestimating their performance.21 The articles reporting model development without any validation were thus not retained. They also recommend having a sufficient sample size and a sufficient number of events (known as the effective sample size). We considered at least 250 patients and 50/50 events and non-events as sufficient. The modelling strategy must also consider enough events per predictor, usually at least 10, to avoid overfitting.22 23 We did not include articles that did not follow these recommendations.
Assessment of the performance of the prognostic tools should use discrimination (ability to differentiate between patients who do or do not experience the event, eg, area under the curve (AUC) receiver operating characteristic (ROC)), calibration (agreement between predictions from the model and observed outcomes) and global measures (simultaneous evaluation of calibration and discrimination, eg, Brier Score). Among the included articles, the retained prognostic tools were those presenting good prognostic performances reported on internal and/or external validation.
Patient and public involvement
This study has no patient or public involvement.
Description of studies
The electronic database search identified 6149 unique references. Screening of titles/abstracts and references checking of included articles and reviews identified 85 eligible papers for full-text review. We did not include 43 articles for the following reasons: 19 univariate models, 2 models without predictor selection and 22 multivariate models without performance reporting. Finally, we included 42 articles (figure 1).
All articles were in English. There were 11 articles published before 2010, 12 between 2010 and 2015 and 19 after 2015. The published teams were mainly from Europe (n=17, 40%) and North America (n=14, 33%). Patients were mostly recruited into an ICU (n=33, 79%). Inclusion criteria were heterogeneous in terms of location or aetiology of the haemorrhage. For ICH, most studies included only spontaneous ICH, some excluding malformations and/or coagulation disorder. For SAH, most included aneurysmal SAH online supplemental tables S3 and S4 present the information regarding inclusion and non-inclusion criteria of each study. The pooled mean age was 59.3 years (SD 13.7) (data not available for six studies). Fifty-three per cent (range 21%–73%) were female (missing data for five studies). The 42 eligible articles reported 128 prognostic tools (figure 1): five articles reported one tool, 16 reported two tools, 7 reported three tools and 14 articles more than three tools, differing by their predictors, their types of outcome or their horizon times. Regardless of the types of predicted outcomes, the sample sizes ranged from 68 to 1629 patients (median 290, IQR 128–413), and the number of events ranged from 21 to 786 (median 64.5, IQR 34–164). Regardless of the time of prediction, most of the prognostic tools predicted mortality (n=75, 59%) (figure 2). Fifty-one (40%) tools studied functional outcomes using the modified Rankin Scale (mRS) or the Glasgow Outcome Scale (GOS). The horizon time for mortality data was mostly short-term (67% at discharge or 1 month), unlike functional outcomes (14% at discharge or 1 month) (figure 2). One study predicted the cognitive status and physical quality-of-life at 12 months. The 452 predictors of these 128 tools mainly involved baseline characteristics (n=95, 21%), admission clinical variables (n=104, 23%), biological measures (n=86, 19%), CT variables (n=95, 21%), ICU-evolution variables (n=29, 6%), existing scores (n=40, 9%) and others (n=3, 1%). Most variables were available on admission, others within 72 hours after ICU admission, and few were available throughout the ICU stay. The prediction time was sometimes unknown.
Model development studies
Twenty-eight studies developed prediction models. Online supplemental table S3 provides complete standardised form and references. Twelve articles focused on patients with ICH only, 15 on patients with SAH only and one on patients with both ICH and SAH. Of the 16 articles on SAH, 14 (87%) reported a functional outcome, while they represented 6 (46%) of the 13 ICH articles. The primary statistical analysis used to develop the scoring system was logistic regression. Other analyses were linear models, Cox models or less well-known statistical methods such as decision tree analysis, Bayesian networks and artificial neural networks. One article did not specify the type of modelling used (see online supplemental table S3) for corresponding references). Predictor selection strategy, which describes the initial pool of variables and the analysed variables, was rarely mentioned.
Among the 28 included articles, 22 articles developed their tool without validation, that is, they reported the apparent prognostic capacities on the training sample only. They were thus not retained. However, few of these studies were well conducted, with a large cohort and long-term outcome and would deserve validations.24–27 Among the 28 included articles, six articles presented a development with internal validation (two using bootstrapping, three cross-validations, one temporal validation). One also reported additional external validation. Online supplemental table S5 lists the methods used to quantify prognostic performances. The authors seldom presented global performances. All reported the discrimination with the AUC of the ROC curve, while calibration measures were not systematic. Of the six studies that developed and validated models, two included fewer than 250 patients and one had less than 50 events. We did not retain them due to this insufficient sample size (figure 3).
Finally, three articles proposed a prognostic tool developed and validated based on recommendations: the FRESH score for SAH (excluding rupture of arteriovenous malformation),11 the ABC score for patients with aneurysmal SAH28 and the score by Degos et al for elderly patients with aneurysmal SAH.29 Table 1 summarises the collected information regarding source population, development approach, validation details and prognostic performances of these three retained scores.
External validation studies
Fourteen articles aimed to externally validate one or more existing models, most of which were not initially developed with severe injuries managed in ICUs. Eleven out of the 28 articles that developed a tool also compared their score to one or more existing models. Finally, 25 articles presented a stand-alone external validation. Online supplemental table S4 provides complete standardised form and references. Online supplemental table S5 lists the methods used to report prognostic performances. Most reported the AUC of the ROC curve; 15 articles had at least one calibration measurement. The authors rarely compared external validation cohorts to the population of the original article. One study proposed recalibration to predict another outcome than the development study.30
Of the 25 studies that externally validated models, 12 included fewer than 250 patients or less than 50 events (figure 3). There were four externally validated general scoring systems. The APACHE II, the SIRS summary score, the SOFA score and the SAPS II showed encouraging performance values when predicting short-term mortality. Because they did not include specific predictors of brain injuries, their use in clinical practice to predict functional or long-term outcomes is not appropriate (figure 3). Injury-specific predictors could extend these scoring systems to improve their predictive capacities and clinical utilities.
There were eight injury-specific externally validated scores. In the ICH population, we retained three externally validated scores: the ICH score,25 31–33 the modified ICH score (MICH)25 34 and the max ICH score.25 33 For the SAH, we retained five tools. Two tools were bivariate, including Glasgow Coma Scale or World Federation of NeuroSurgeons (WFNS) scale associated with CT features: a three-coloured grading system termed the VASOGRADE35 36 and the Hijdra score for aneurysmal SAH.37 38 Three tools were multivariate models: the HAIR score11 36 39 40 and the SubArachnoid Haemorrhage International Trialists (SAHIT) score for SAH,41 42 and the international subarachnoid aneurysm trial score for aneurysmal SAH.43 44 Tables 2 and 3 summarise, for ICH and SAH, respectively, the collected information regarding the source population, development approach, validation details and prognostic performances of these eight scores.
Retained prognostic scores from included studies
Finally, for each included study (Development and validation or Stand-alone external validation), we reviewed the levels of prognostic performances for the final selection of multivariate prognostic scores that can be easily applicable for practical use. Among the prognostic tools for ICH, we did not retain the MICH score because of the lack of reporting calibration that did not guarantee agreement between predictions and observed outcomes. We thus highlighted two scores (figure 3). The ICH score31 was externally validated in three large ICU cohorts, predicting 1-month, 3-month and 12-month mortality or functional outcome (mRS 4–6).25 32 33 The max ICH score25 predicted 3-month and 12-month mortality and functional outcome (mRS 4–6), based on CT predictors (lobar and non-lobar ICH volume, age, National Institutes of Health Stroke Scale, presence of intraventricular haemorrhage and anticoagulant therapy). This showed good performances in a large external ICU cohort.33 Table 2 presents the original publication, the external validation studies and corresponding performances (discrimination and calibration).
Among the retained SAH tools, the level of clinical utility and prognostic capacities was debatable. Tables 1 and 3 detail the strengths and limitations of each of these scores. The vast majority of tools presented high discrimination. We did not retain the Hijdra score37 because of weak discrimination or absence of calibration. Additionally, the VASOGRADE,35 the FRESH score,11 the ABC score28 and the Degos score for the elderly29 lacked reporting calibration or used the Hosmer-Lemeshow goodness-of-fit test. The ISAT score and HAIR score, which had a low calibration for high-risk SAH, would probably benefit from recalibration or updating.39 43 We thus only retained the SAHIT score41 (figure 3). In a single external validation,42 it predicted either an unfavourable outcome (mRS 3–6) or mortality at 6 months, based on clinical predictors (age, history of hypertension and WFNS preoperative neurological grade) and CT (Fisher grade, aneurysm size and location). It revealed good discrimination and calibration.
While studies labelled as ‘prognostic’ abound in the literature on intracranial haemorrhage, our systematic review dedicated explicitly to critical patients revealed a lack of methodological robustness. Of the 85 read articles, we identified six articles that developed a prognostic tool supported by a validation study and 25 external validation studies. After critical appraisal of the articles, we retained, for the ICH population, the ICH score,31 which has better performances for the shorter outcome, and the max ICH score.25 For the SAH population, we retained the SAHIT score for its high methodological quality.42
The ICH score,31 developed in 2001, has benefited from multiple external validations in many different populations. The American Heart Association guidelines45 recommend its reporting. In external validations with severe ICH, its performances could be better, particularly for longer term and functional outcomes.25 32 33 It would be interesting to consider updating or recalibrating this tool. The max ICH score,25 developed in 2017, showed good calibration and discrimination on only one external cohort, with satisfying calibration and better performances than the ICH score on the same sample.33 It would benefit from further validations in other large and recent cohorts. The SAHIT score, developed in 2018, predicted unfavourable outcome or mortality at 3 months in a low to severe SAH population.41 The single external validation in an ICU cohort revealed good prognostic performances that further studies have yet to be confirmed.42
In our systematic review, the authors rarely highlighted the clinical objective, which leads us to believe that clinical purposes did not drive most score elaborations. Functional outcomes in the modern setting of critical care make more sense than mortality outcomes for patients who are more likely to survive but face disabilities.46 The ordinal functional outcomes scales are almost systematically dichotomised (GOS 1–3 vs 4–5, mRS 4–6 or 3–6 vs 0–3 or 0–4). These thresholds, though never justified, should depend on the clinical objective. If the score’s purpose is to support clinicians in making ethically challenging decisions, such as withdrawal of care, it is not reasonable to place severe disabilities, vegetative state and death on the same unfavourable side. Besides, a prognostic tool on its own, as rigorous as it may be, is hardly capable of integrating the strong human dimension of such a complex decision. Multidisciplinary clinical teams should rely on a combination of considerations, which include multivariable scoring systems. If the clinical objective is instead to inform patients and their relatives of the evolution prospects, the condition they consider to be favourable should be determined by themselves and ideally over the very long-term.47 48 In our systematic review, the longest prediction horizon was 12 months, that is, before stabilisation of functional recovery and the ability to adapt to such a consolidated statement.49 Moreover, patient perception could weigh the different levels of functional disabilities.50 51 Indeed, survivors have a wide range of life-long consequences such as neuropsychological difficulties, memory problems, fatigue and physical complaints, that is, dimensions not explored with functional outcome scales.51 52 As these symptoms are not always apparent, only validated patient (or caregivers) reported questionnaires can reflect the subjective perception of their quality of life.51 53 In our systematic review, the only article mentioning quality of life concerns the FRESH score.11 Even though some methodological choices are questionable in this study, we think that it deserves attention because it surpasses the functional outcomes by integrating the quality of life as an objective of prediction.
In our systematic review, we identified several methodological pitfalls. A large proportion of eligible studies are wrongly labelled ‘prognostic models’. Some authors did not report prognostic performances, sometimes because they wrongly interpreted the odds ratio as a prognostic ability. These mistakes revealed considerable confusion in the literature between the notions of correlation and prediction.54 Some development studies only reported apparent prognostic performances. This lack of internal or external validation led to overestimating the performances of the prognostic tools.21 Several studies based on small sample size or a small number of events resulted in the risk of overfitting or low credibility in terms of prognostic performances.55 These studies would benefit from external validations with recent and large cohorts. There was heterogeneity in the prognostic performances’ reports: discrimination was systematic, only about half of the retained studies assessed calibration and 10% global performance. Calibration curves, rarely reported, allow future external validation to assess the eventual need for recalibration or updating, to adapt it to the population of interest. The popular Hosmer-Lemeshow Goodness-of-fit test is known to perform poorly, making its use regrettable.56 We discarded several studies due to the absence of variables selection. The included studies rarely specified the predictor selection strategy, which describes the initial pool of variables and the analysed variables. This precision allows the reader to assess the risk of overfitting. The prediction time was sometimes unknown, making the score challenging to apply. Authors should clearly state this information to inform the user of when to calculate the prediction. Authors who studied long-term outcomes always chose to use logistic regression by excluding patients lost to follow-up when times-to-event methods would have been more appropriate in the presence of such censoring. Finally, this resulted in a very low number of prognostic tools that seemed methodologically correct and presenting a reasonable prognostic performance level. However, weaker validation results do not mean that the model is incorrect. If scores’ development approaches were optimal, relevant predictors could be recalibrated and combined with new data to validate a strong tool.57 58
A consortium of experts published the TRIPOD statement in 2015, clearly setting out how to report prognostic information.17 Of the 85 full texts screened in our review, 35 (41%) were printed after the TRIPOD publication in 2015. Of these, we finally retained 19 (54%) articles published after the TRIPOD publication, whereas we retained only 11 (22%) from the 50 articles published before the TRIPOD publication. Similar to Zamanipoor Najafabadi et al, we noticed a trend towards quality improvement, reinforced with the necessary ongoing validation of existing scores.59 Our systematic review revealed that some robust published scores, outlined in reviews focusing on non-severe to severe intracranial haemorrhages10–13 (eg, FUNC score,60 Essen ICH score61 and ICHOP score62), have not yet been validated in the ICU population. To use them reliably in such settings, they should be externally validated with critical patients. We also did not find tools dedicated to severe specific populations (such as haemorrhages secondary to malformation or patients with coagulation disorders). External validations would be interesting for these populations (eg, patients under anticoagulant63 or arteriovenous malformation64). Another option would be to extend existing scores with these risk predictors, such as the Max ICH-score, which includes the variable ‘presence of oral anticoagulant’.25 With the rapid evolution of therapeutic advances in neurocritical care, the ongoing prognostic studies should focus on temporal validation and updating/recalibrating existing good scores to ensure their performance validity.55 It is also possible to extend this by incorporating additional modern variables.
This review has several limitations. First, we aimed to include tools dedicated to ICH or SAH managed in the ICU. Because of the lack of severity classification for these pathologies, and heterogeneity of patients admitted to ICUs, we defined our proper severity criteria, which is debatable. Second, only one assessor conducted the study screening on title/abstract. This may have resulted in some missing eligible studies. Third, we did not use a formal tool to study the risk of bias such as the recent Prediction model Risk Of Bias ASsessment Tool (PROBAST) based on the TRIPOD.65 66 Following the TRIPOD recommendations, we built our own standardised form collecting similar information than the PROBAST items. Fourth, due to the heterogeneity in the included models, we could not to perform a meta-analysis. Finally, as with any systematic review, our work underwent publication bias issue. Similar to randomised clinical trials, we cannot exclude that unpublished studies may have negative results or size effects different from published studies.67 One consequence could be, for instance, the underrepresentation of external validation studies with non-confirmatory prognostic performances.
Our review identified several methodological pitfalls and incomplete reporting in prognostic articles on intracranial haemorrhages managed in ICU. Among the many published scores for ICH and SAH, some deserve further attention. Rather than developing new scores, future authors should focus on externally validating and updating well-developed existing scores with large and recent cohorts, relying on methodological syntheses such as the TRIPOD statement.17 57 68 We have chosen to emphasise the ICH score, the max ICH score and the SAHIT scores for their superior prognostic performances. Nevertheless, they need ongoing validations, recalibrations and impact studies to improve them. The use of ‘patient-centred’ outcomes that have yet to be defined could also enhance the tools in the delicate, medical and ethical setting of critical care. Beyond all methodological issues, patient-centred clinical finality should guide prognostic tools to be convincing.
Data availability statement
Data are available upon reasonable request.
Patient consent for publication
Contributors JS-P: conceptualisation and design of the review, data acquisition, statistical analysis, writing: original draft, review and editing. YF: conceptualisation, data acquisition, providing statistical expertise, writing: original draft, review and editing. FF: data acquisition, writing: review and editing. ML, LB-C, RC and DF: data acquisition, providing clinical expertise, writing: review and editing. ED: project management, conceptualisation, methodology, data acquisition, writing: original draft, review and editing.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests YF reports personal fees for statistical training from Sanofi and Biogen, outside the submitted work.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.