Article Text

Original research
Head-to-head comparison of 14 prediction models for postoperative delirium in elderly non-ICU patients: an external validation study
  1. Chung Kwan Wong1,
  2. Barbara C van Munster1,
  3. Athanasios Hatseras1,
  4. Else Huis in 't Veld1,
  5. Barbara L van Leeuwen2,
  6. Sophia E de Rooij1,
  7. Rick G Pleijhuis3
  1. 1Department of Geriatrics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
  2. 2Department of Surgery, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
  3. 3Department of Internal Medicine, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
  1. Correspondence to Dr. Rick G Pleijhuis; r.g.pleijhuis{at}


Objectives Delirium is associated with increased morbidity, mortality, prolonged hospitalisation and increased healthcare costs. The number of clinical prediction models (CPM) to predict postoperative delirium has increased exponentially. Our goal is to perform a head-to-head comparison of CPMs predicting postoperative delirium in non-intensive care unit (non-ICU) elderly patients to identify the best performing models.

Setting Single-site university hospital.

Design Secondary analysis of prospective cohort study.

Participants and inclusion CPMs published within the timeframe of 1 January 1990 to 1 May 2020 were checked for eligibility (Preferred Reporting Items for Systematic Reviews and Meta-Analyses). For the time period of 1 January 1990 to 1 January 2017, included CPMs were identified in systematic reviews based on prespecified inclusion and exclusion criteria. An extended literature search for original studies was performed independently by two authors, including CPMs published between 1 January 2017 and 1 May 2020. External validation was performed using a surgical cohort consisting of 292 elderly non-ICU patients.

Primary outcome measures Discrimination, calibration and clinical usefulness.

Results 14 CPMs were eligible for analysis out of 366 full texts reviewed. External validation was previously published for 8/14 (57%) CPMs. C-indices ranged from 0.52 to 0.74, intercepts from −0.02 to 0.34, slopes from −0.74 to 1.96 and scaled Brier from −1.29 to 0.088. Based on predefined criteria, the two best performing models were those of Dai et al (c-index: 0.739; (95% CI: 0.664 to 0.813); intercept: −0.018; slope: 1.96; scaled Brier: 0.049) and Litaker et al (c-index: 0.706 (95% CI: 0.590 to 0.823); intercept: −0.015; slope: 0.995; scaled Brier: 0.088). For the remaining CPMs, model discrimination was considered poor with corresponding c-indices <0.70.

Conclusion Our head-to-head analysis identified 2 out of 14 CPMs as best-performing models with a fair discrimination and acceptable calibration. Based on our findings, these models might assist physicians in postoperative delirium risk estimation and patient selection for preventive measures.

  • geriatric medicine
  • surgery
  • internal medicine
  • delirium & cognitive disorders
  • risk management

Data availability statement

Data are available on reasonable request. The PRIDE database is available upon reasonable request.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This study encompasses the largest head-to-head comparison of clinical prediction models (CPM) for predicting postoperative delirium in elderly non-intensive care unit patients reported to date.

  • Prospectively collected data were reused for validation purposes.

  • Model variables not available in the dataset were substituted for using equivalent variables if available.

  • Evaluated performance measures included both classical statistical metrics and decision curve analysis, providing a solid basis for model comparison.

  • Identification of eligible CPMs during a 30-year time span was partly based on previously published systematic reviews, which might have resulted in relevant CPMs being overlooked.

  • The total number of events in the dataset was limited to 25 patients with delirium, resulting in limited power for included prediction models with a large number of variables.


Delirium is a mental disorder, characterised by an acute fluctuating disturbance in awareness and attention accompanied by cognitive deficits such as memory, orientation, language and perception. It is typically caused by an underlying disturbance, such as infection or electrolyte imbalances. Delirium is also common in the postoperative period following (major) surgery and is associated with increased morbidity, mortality and prolonged hospitalisation increasing healthcare costs.1 2 It is also known to reduce long-term cognitive function, even years after the patient was discharged from the hospital.2 Incidence of postoperative delirium reported in the literature ranges from 10% to 70%, depending on the type of surgery performed, characteristics of the patient population and criteria used for the diagnosis.3

Importantly, delirium is considered preventable in up to 40% of the cases.4 5 For example, non-pharmacological multicomponent interventions were shown effective in reducing the risk of developing delirium in the elderly.6 Yet, these interventions can be time-consuming and costly, limiting widespread application when resources are scarce. Adequate stratification of delirium risk is of great importance to make sure preventive interventions are provided to those patients expected to benefit most from them. Clinical prediction models (CPMs) can be applied for these purposes.7 Over the past few decades, the number of CPMs to predict postoperative delirium has steeply increased and continues to do so. CPMs can support patient selection for preventive measures by differentiating low-risk and high-risk patients based on the presence of risk factors associated with delirium. In order for CPMs to be used in a safe and responsible manner, it is of utmost importance that information on the development and overall performance of these models is made available to clinicians. Yet, this information is often lacking.8 For example, the majority of published CPMs for postoperative delirium have not been externally validated, therewith lacking essential information on model robustness and generalisability. Even when external validations have been performed, they are usually based on different patient cohorts, hampering direct comparison of model performance between CPMs.

The aim of this study is to perform a head-to-head comparison of discriminative power, calibration and clinical utility of previously published CPMs to predict postoperative delirium in non-intensive care unit (non-ICU) elderly patients. For this purpose, we used a single prospectively obtained validation cohort to externally validate multiple CPMs simultaneously.


Literature search

An extensive literature search was performed to identify CPMs eligible for external validation. The search comprised two parts, conducted separately.

First, we searched the MEDLINE database for systematic reviews focusing on the prediction of postoperative delirium in elderly non-ICU patients. Systematic reviews published between 1 January 1990 and 1 May 2020 that fulfilled the selection criteria (online supplemental figure 1) were selected. We then extracted all CPMs deemed eligible for validation based on inclusion and exclusion criteria from the systematic reviews, followed by removal of any duplicates.

Second, an extended literature search was performed using the MEDLINE database to cover the time periods not previously taken into account by systematic reviews. Search terms were carefully selected with support of a clinical librarian. A detailed overview of the final search strategy used for the extended search is provided in online supplemental figure 2.

Selection of studies

Studies were selected by two investigators (AH, EHV) who evaluated all search results independently in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Discrepancies between both investigators were solved by a third and fourth evaluator (CKW, RP). The same inclusion and exclusion criteria were applied to studies extracted from systematic reviews as well as studies identified through the extended search.

Studies were included when: (i) a prediction model was developed with predicted risk of delirium as the primary outcome and (ii) the primary focus was on elderly hospitalised surgical or mixed (surgical and medical) patients (defined as mean age ≥60 years). Studies were excluded when: (i) the primary focus was on deviant patient populations: delirium tremens, dementia, stroke, psychiatric disorders, acute kidney injury, shock, palliative phase, non-surgical (ie, medical) or intensive care patients, (ii) only external validation of a previously published model was performed, (iii) no prediction model was developed (ie, risk factors reported only or prediction based on a single variable), (iv) the published prediction model could not be reconstructed due to incomplete reporting of model parameters, (v) no full-text article or English translation was available, (vi) the study was considered non-original research and (vii) no incidence of delirium was reported. Finally, CPMs eligible for external validation were selected based on the availability of required model variables in the external validation cohort.

External validation cohort

External validation was performed using an independent dataset (Performance of Risk stratification Instruments for postoperative DElirium; PRIDE cohort) as previously described.9 In brief, we used an independent dataset comprising 292 elderly hospitalised patients who underwent various surgical procedures between 1 October 2011 and 1 June 2012 in the University Medical Center Groningen, The Netherlands, previously described by Jansen et al.9 This dataset, containing prospectively obtained data, was used to externally validate multiple eligible prediction models simultaneously.

Minor deviations between original model variables and variables available in the dataset were resolved by using substitute parameters. CPMs with more than one variable missing in the validation cohort were excluded. In case of a single missing dichotomous variable, the CPM was still included and a sensitivity analysis was performed.

Patient and public involvement

During this study, there was no direct involvement of the public or patients in the design, conduct or reporting of the research. The results of this study are expected to enhance patient involvement, facilitating shared decision-making.

Head-to-head evaluation of overall model performance

To judge the selected CPMs on their merits and compare them head-to-head, key performance measures were evaluated regarding model discrimination, calibration and clinical usefulness.10

Model discrimination

Discrimination refers to the ability of the CPM to distinguish patients who develop delirium from those who do not develop delirium. Model discrimination is expressed as the area under the receiver operating characteristic curve, or ‘c-index’, which plots the sensitivity (true-positive rate) against 1−specificity (false-positive rate) for consecutive cut-offs of the predicted risk. Perfect discrimination gives a c-index of 1, and no discrimination (no better than the toss of a coin) results in a c-index of 0.50. Prediction models with c-indices between 0.9 and 0.99 are considered to have excellent discrimination, 0.8 and 0.89 good discrimination, 0.7 and 0.79 fair discrimination and 0.51 and 0.69 poor discrimination.11

Model calibration

Calibration refers to the agreement between predicted and observed risk.12 It can be assessed graphically in a plot with predicted probabilities on the x-axis and the proportion of observed risk (delirium present or absent) on the y-axis (figure 1). Perfect predictions should be located on the reference line, described with an intercept of 0 and a slope of 1, indicating that predicted and observed outcomes are alike. The intercept compares the mean of all predicted risks with the mean observed risk. This parameter hence indicates the extent that predictions are systematically too low or too high. The slope is a measure of spread of predicted probabilities.13

Figure 1

Flow chart indicating the selection process of included delirium prediction models.

(Scaled) Brier score

The Brier score is a composite measure based on the mean square error of predictions, assessing both discrimination and calibration.14 It can be used to compare performance between CPMs predicting binary outcomes (ie, delirium present or absent), with lower scores indicating superior models. A Brier score of 0 represents a perfect model. Scaled Brier scores were calculated to take the baseline incidence of delirium into account, facilitating result interpretation.

Clinical usefulness

When assessing predictive value, although of importance, traditional statistical metrics as discrimination and calibration are not directly informative with regard to clinical value. As a means to overcome these limitations, Vickers and Elkin introduced the concept of decision curve analysis, providing a more holistic understanding of the clinical relevance of CPMs.15 In brief, decision curve analysis calculates a clinical ‘net benefit’ for CPMs in comparison to default strategies of imposing an intervention for all or no patients.16 Net benefit is calculated across a range of threshold probabilities, defined as the minimum probability of disease at which further intervention would be warranted.

Statistical analysis

Continuous baseline characteristics are presented as mean and SD in the case of normally distributed data, whereas skewed data are presented as median and IQR.

We used multiple performance measures to evaluate model performance based on previously published recommendations for reporting on external validation studies.10 These included: calibration plot (calibration-in-the-large) and model intercept, calibration slope, discrimination with concordance statistic and clinical usefulness with decision curve analysis.

As recommended by Steyerberg et al,12 we used the scaled Brier score as a combined measure of model discrimination and calibration instead of the goodness-of-fit (Hosmer-Lemeshow) test.17 18

Sensitivity and specificity rates were calculated for all models. Negative and positive predictive values strongly depend on delirium incidence and were therefore not reported.

Calculations were performed semi-automatically using R-based validation software V.2.18 (available at Differences in discriminative power between CPMs were assessed by comparing area under the curves using MedCalc V.20.015.

Handling of missing data

Missing data were reported separately for each model. A complete case analysis was performed without using imputation techniques. CPMs were excluded if >30% of the patients in the validation cohort had missing data for either one or more of the variables included in the CPM.

Sensitivity analysis

To evaluate how sensitive CPM outputs are to changes in inputs, a sensitivity analysis was performed for variables in the validation cohort that could not be implemented exactly as described in the original studies. Analyses were repeated by changing the specific variables to extreme values, to investigate their impact on model performance.

Assessment of quality

For quality assessment of included articles, we used the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) 20-item checklist for prediction model development.20 The following checklist items were deemed not applicable and were therefore excluded: details of treatments received, actions to blind assessment of outcome and actions to blind assessment of predictors.


Literature search

A total of four systematic reviews were identified that reported on CPMs for postoperative delirium in elderly non-ICU patients.9 21–23 Altogether, the systematic reviews covered a time period from 1 January 1990 to 1 January 2017 with partial overlap. After removal of duplicates, 56 unique studies were further evaluated, resulting in 30 CPMs for validation after application of exclusion criteria (figure 1). Model variables required for validation were available in the validation cohort for 9 out of 30 (30%) CPMs, leaving 9 CPMs extracted from systematic reviews eligible for external validation.

The extended systematic search, covering a time period from 1 January 2017 to 1 May 2020 not covered by previously published systematic reviews, resulted in 3405 titles after removal of duplicates. A total of 391 abstracts were selected for review of full texts. No full text was available for 79 abstracts, leaving 312 articles for further evaluation. After application of exclusion criteria (detailed in figure 1), the extended search resulted in 32 additional CPMs suitable for validation. Model variables required for validation were available in the validation cohort for 5 out of 32 (16%) CPMs, leaving 5 CPMs identified through the extended search eligible for external validation.

External validation

Baseline characteristics of the external validation cohort are shown in table 1. In brief, the cohort consisted of 292 elderly hospitalised patients with a mean age of 66 years (SD ±8 years). All patients underwent surgery (general, oncological, vascular, hepatobiliary or ‘other’), of which the vast majority (90%) concerned elective procedures. A total of 25 patients (9%) developed delirium postoperatively.

Table 1

Baseline characteristics of external validation cohort

An overview of all CPM variables (table 2) and their matched variables from the validation cohort (online supplemental table 1) is provided. Risk factors most frequently used in the included CPMs were increasing age and pre-existing impaired cognition. External validation was previously published for 8 out of 14 (57%) CPMs (table 3). In all cases, c-indices of previously externally validated CPMs were higher compared with our findings.

Table 2

Overview of study populations, diagnostic instruments and model variables for all models included for analysis

Table 3

Performance of the included clinical prediction models on external model validation

Overall performance of clinical prediction models for postoperative delirium

Head-to-head evaluation of overall model performance was assessed for 14 included CPMs simultaneously (table 3). Calculated c-indices ranged from 0.52 to 0.74, intercepts from −0.02 to 0.34, slopes from −0.74 to 1.96, Brier scores from 0.07 to 0.22 and scaled Brier scores from −1.29 to 0.088.

For the vast majority (12 out of 14) of included CPMs, model discrimination was considered poor with corresponding c-indices <0.70 (figure 2). Model calibration and clinical usefulness for all included CPMs are represented graphically as calibration plots and clinical decision curves, respectively, in online supplemental figure 3. A positive net benefit was observed in the 5%–20% and 10%–30% threshold probability range for CPMs developed by Dai et al and Litaker et al, respectively, suggesting superiority to the ‘treat none’ strategy at these thresholds. For Ettema et al, positive net benefit was observed in the 10%–15% threshold probability. CPMs developed by Pompei et al,24 Rudolph et al,25 26 Carrasco et al,27 Kim et al,28 de Wit et al,29 Pendlebury et al,30 Halladay et al,31 Ten Broeke et al32 and Zhang et al33 showed limited net benefit.

Figure 2

Head-to-head comparison of discriminative power of delirium prediction models. Discriminative power of externally validated delirium prediction models is reported as c-indices with associated 95% CIs, ranked from low to high. A c-index of 0.5 resembles a situation in which the model has no discriminative power, that is, the model predicts no better than flipping a coin. Only 2 out of 14 validated models showed fair discrimination with c-indices >0.70 (0.71 and 0.74 for the models developed by Litaker et al and Dai et al, respectively) and 95% CIs with lower bounds >0.50. Discriminative power of the remaining 12 models was considered poor.

The two best performing CPMs were Dai et al (c-index: 0.739; 95% CI: 0.664 to 0.813; intercept: −0.018; slope: 1.96; Brier score: 0.077, scaled Brier score: 0.049) and Litaker et al (c-index: 0.706; 95% CI: 0.590 to 0.823; intercept: −0.015; slope: 0.995; Brier score: 0.074, scaled Brier score: 0.088) (table 3). Graphical representations of discrimination, calibration and clinical usefulness of both models are shown in figure 3.

Figure 3

Discrimination, calibration and clinical utility of best performing models. Panels A and B show the receiver operating characteristic (ROC) curve of the delirium prediction models by Litaker et al and Dai et al, respectively, with the area under the ROC curve (c-index) indicating the discriminative power of the model. A graphical representation of the calibration of both models is shown in panels C and D, plotting the predicted probability (x-axis) with corresponding 95% CI against the actually observed occurrence of delirium in the validation cohort (y-axis). The model by Litaker et al showed adequate calibration (panel C), correctly differentiating patients at low risk of delirium (20%). The model by Dai et al correctly identified patients at low risk (20%). Panels E and F show decision curve analyses as a measure of clinical utility of both models. For the models by Litaker et al and Dai et al, a positive net benefit was observed in the 10%–35% threshold probability range (panel E) and the 5%–20% threshold probability range (panel F), respectively.

On secondary analysis, there was no significant difference in model performance between the CPMs developed by Dai et al and Litaker et al. Yet, the discriminative power of Dai et al significantly differed from almost 50% of all included CPMs. Direct comparison of model discrimination between the remaining 12 CPMs showed no significant difference (online supplemental table 2).

Sensitivity analysis

In case no matching variable was available in our database, analyses were repeated by using substitute variables to inquire the possible dependency of results on the definition of the risk factors. This was performed for the following variables: activities of daily living, infection, comorbidities, severity of illness and memory problems. No significant differences in CPM performance were observed for any of the variables (data not shown). In case of minute differences in CPM performance between different substitute variables, the variable resulting in the best overall CPM performance was ultimately selected.


The goal of this study was to identify clinical prediction models for delirium developed and published since 1990 and to compare their performance head-to-head. Overall, we identified 62 CPMs that were developed for predicting postoperative delirium risk over the last 30 years, of which 14 (23%) could be externally validated using our independent cohort. As studies comparing similar models head-to-head are lacking, caregivers find themselves confronted with the difficult task to select the best-suited model from the great variety of models available. In our study, the two best performing models were those of Dai et al and Litaker et al, with c-indices of 0.739 and 0.706, respectively, regarded as adequate discrimination.34 35 Both models showed acceptable calibration, sufficiently stratifying patients in different risk groups. Based on these findings, these two models were considered most promising for guiding patient selection for preventive measures out of 14 evaluated models.

Risk factors for delirium have been studied extensively in the past few decades, hence a multitude of identified risk factors exist for delirium in hospitalised elderly.36–38 Clinical prediction models based on these risk factors provide an integrated approach in delirium risk estimation. Many CPMs for delirium were developed in specific niche populations which may hamper their generalisability and applicability in daily practice in the overall hospitalised populations. For example, many models contained highly specific biomarkers that are not readily available in most hospitals. In addition, instruments used to determine cognitive impairment often differed between models, making a direct comparison challenging. This was reflected by our finding that only 14 out of 62 studies could be validated despite the fact that we made use of a prospectively collected patient database containing over 200 distinct variables.

Preoperative stratification of patients based on estimated risk for postoperative delirium could identify those patients expected to benefit most from preventive measures. Although there is no conclusive evidence that different drugs are effective in preventing delirium,39 the evidence for non-pharmacological multicomponent interventions is considered sufficiently robust for clinical practice recommendations in elderly non-ICU patients.6 In healthcare institutions, applying multicomponent non-pharmacological measures to all patients would result in a high burden on scarce human and material resources. The labour-intensive and costly nature of multicomponent interventions requires appropriate selection of patients who are expected to benefit most from such interventions or for whom certain interventions could be omitted. In addition, prediction models can be used to inform patients regarding their individual risk to develop postoperative delirium, providing a solid basis for shared decision-making.

Assessment of model performance in external validation cohort

There is broad consensus that CPMs must be validated in independent patient cohorts prior to clinical application. In reality, external validation studies are often lacking, as was also the case for 6 out of 14 (43%) CPMs included in our study. A possible explanation might be that the information provided in model development studies is often lacking specific details (eg, an intercept in the case of logistic regression analysis) to reproduce the original model formula. Other reasons might be the time-intensive nature of external validation and the apparent tendency to develop new CPMs rather than evaluating existing ones.

Methodological guidelines recommend external validation in terms of discrimination and calibration to assess model robustness and generalisability.10

We found that the discriminative power determined in the original studies was higher in all cases (∆ c-indices ranging from 0.116 to 0.391) compared with our validation despite the nature of our study population, consisting solely of postoperative patients. A possible explanation is the tendency of overfitting in the case of narrow validation when the same database is (partly) used for model derivation and validation purposes.

Although assessment of calibration performance is an important measure to interpret CPM performance in addition to model discrimination, it has generally received little attention. As shown by Calster et al, poorly calibrated CPMs can be misleading and potentially harmful for clinical decision-making.40 In our current study, model calibration was assessed for all included CPMs and compared head-to-head.

In addition to conventional statistical performance measures, there is a growing interest in the use of decision curve analysis to evaluate net clinical benefit of CPMs in clinical practice.15 Decision curve analysis incorporates the consequences of the decisions made on the basis of a CPM, regarding impact on utilities, costs and harms. It is therefore considered a direct measure of clinical value.16 In the case of delirium risk prediction, a false-positive result (ie, patient falsely stratified as high-risk) is usually not harmful to the patient. False-negative outcomes (ie, patient falsely stratified as low-risk), however, could result in withholding adequate preventive measures for delirium development, exposing the patient to an increased risk for medical complications, prolonged hospitalisation and long-term adverse effects.2 The medical, psychological and economic effects of false-negative results are therewith considered to outweigh those of false-positive results.

Guidelines for transparent reporting

To be able to adequately assess potential usefulness and risk of bias of prediction models, full and clear reporting of information on all aspects are a prerequisite.41 Yet, multiple reviews concluded that reporting of model development is poor with insufficient information described in all aspects, from descriptions of patient data to statistical modelling methods.41–43 In response, a collaborative network of international experts developed methodological guidelines, like the TRIPOD checklist, to facilitate accurate, complete and transparent reporting.44 In this study, we noticed an improving trend of overall quality of reporting of studies since the introduction of the TRIPOD statement in 2015, although this finding was only based on 14 publications.

Enhancing clinical applicability of CPMs by using an online platform

To facilitate clinical application, published prediction models are sometimes made available as a digital calculator through a website or mobile device app, usually dedicated to a single model. As a result, the landscape of digital calculators is highly fragmented. This confronts healthcare professionals with new challenges. An example includes the current lack of transparency, that is, lack of insight in underlying model formulas, source codes or characteristics of the derivation cohort, turning digital calculators into a ‘black box’. Another challenge is how to ascertain their quality and performance when external validations or head-to-head comparisons are not available. The current lack of standardisation results in limited scalability as well as relatively high costs for hosting and updating digital calculators. Indeed, multiple examples exist of websites and apps that are no longer supported or even withdrawn several years after the initial project funding ceased to exist. Even when a prediction model meets all the above-mentioned requirements, this is still no guarantee that the model is actually applied in clinical practice. To facilitate clinical implementation, prediction models should be easily accessible in the clinical workflow, that is, integrated in the electronic health record system, digital protocols or decision support systems. The current variation in prediction models made available through different websites and apps, however, hampers (scalable) possibilities for integration.

To address abovementioned issues, we made use of an existing cloud-based platform that facilitates the standardised creation, head-to-head comparison and integration of CPMs ( After identifying the best performing CPMs in a given target population, an intuitive user interface can be added automatically to facilitate CPM use (online supplemental figure 4). In addition, direct integration of CPMs in the clinical workflow (ie, electronic health record system) is expected to further increase their impact on clinical decision-making.7 Before the CPMs evaluated in the current study can be generally applied in a clinical setting, however, further validations in different cohorts are encouraged to further consolidate our findings in terms of model robustness and generalisability in non-surgical populations.


Over the last few decades, the number of CPMs developed to predict postoperative delirium has increased exponentially. Overall reproducibility was limited due to the requirement of specific variables not commonly available in daily practice and a lack of reported details to reconstruct model formulas. Nearly half of the CPMs included in our study had previously not been validated in an independent cohort. Our head-to-head analysis of 14 CPMs identified two best-performing models with a fair discrimination and acceptable calibration. Corresponding clinical usefulness was considered promising based on decision curve analysis. Based on our findings, these models might assist physicians in postoperative delirium risk estimation and selection of elderly non-ICU patients for preventive measures, although further validations in different cohorts are encouraged.

Data availability statement

Data are available on reasonable request. The PRIDE database is available upon reasonable request.

Ethics statements

Patient consent for publication

Ethics approval

This study does not involve human participants.


The authors are thankful to Carolien Jansen and clinical librarian Sjoukje van der Werf for granting permission to use the PRIDE validation cohort and providing support in conducting the systematic search, respectively.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors BCvM, SDR and RGP designed the study. AH and EHiV conducted the systematic search supervised by CKW and RGP. CKW and RGP externally validated selected clinical prediction models. BLvL provided data for external validation. CKW, RGP, AH and EHiV wrote the manuscript. BCvM, BLvL and SDR critically revised the manuscript. CKW submitted the manuscript. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. RGP accepts full responsibility for the work and/or the conduct of the study, had access to the data, and controlled the decision to publish.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.