Article Text

Download PDFPDF

Towards a decision support tool for intensive care discharge: machine learning algorithm development using electronic healthcare data from MIMIC-III and Bristol, UK
  1. Christopher J McWilliams1,
  2. Daniel J Lawson2,
  3. Raul Santos-Rodriguez1,
  4. Iain D Gilchrist3,
  5. Alan Champneys1,
  6. Timothy H Gould4,
  7. Mathew JC Thomas4,
  8. Christopher P Bourdeaux4
  1. 1 Engineering Mathematics, University of Bristol, Bristol, UK
  2. 2 Integrative Epidemiology Unit, Population Health Sciences, University of Bristol, Bristol, UK
  3. 3 Department of Experimental Psychology, University of Bristol, Bristol, UK
  4. 4 Intensive Care Unit, University Hospitals Bristol NHS Foundation Trust, Bristol, UK
  1. Correspondence to Dr Christopher J McWilliams; chris.mcwilliams{at}bristol.ac.uk

Abstract

Objective The primary objective is to develop an automated method for detecting patients that are ready for discharge from intensive care.

Design We used two datasets of routinely collected patient data to test and improve on a set of previously proposed discharge criteria.

Setting Bristol Royal Infirmary general intensive care unit (GICU).

Patients Two cohorts derived from historical datasets: 1870 intensive care patients from GICU in Bristol, and 7592 from Medical Information Mart for Intensive Care (MIMIC)-III.

Results In both cohorts few successfully discharged patients met all of the discharge criteria. Both a random forest and a logistic classifier, trained using multiple-source cross-validation, demonstrated improved performance over the original criteria and generalised well between the cohorts. The classifiers showed good agreement on which features were most predictive of readiness-for-discharge, and these were generally consistent with clinical experience. By weighting the discharge criteria according to feature importance from the logistic model we showed improved performance over the original criteria, while retaining good interpretability.

Conclusions Our findings indicate the feasibility of the proposed approach to ready-for-discharge classification, which could complement other risk models of specific adverse outcomes in a future decision support system. Avenues for improvement to produce a clinically useful tool are identified.

  • cinical audit
  • health informatics
  • information technology

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • Training data from multiple source domains is leveraged to produce general classifiers.

  • The restrictive feature representation tested could be expanded to better exploit the richness of available data and boost performance.

  • Our approach has the potential to streamline the discharge process in cases where patient physiology makes them clear candidates for a de-escalation of care.

  • High-risk patients would require additional levels of decision support to facilitate complex discharge planning.

Introduction 

Demand for intensive care unit (ICU) beds is rising at a time when the resource is constrained.1 In order to optimise the allocation of this resource, patients should be discharged from the ICU as soon as they no longer require the specialist input provided there. The reduced ICU capacity caused by discharge delay can result in the delayed admission of patients requiring critical care.2 3 Furthermore, patients remaining in the ICU after they are medically fit to leave are at risk of iatrogenic harm and may experience detrimental effects on physical rehabilitation and psychosocial well-being.4

The identification of individuals that are ready to leave ICU is a key component of patient flow through the hospital. At present this identification is a manual process, relying on physicians reviewing patients on a ward round at a standard point in time. There is a lack of formal guidance to inform discharge readiness and as such the process is sensitive to both the decision-making heuristics of individual clinicians and structural factors within the hospital.5 A number of studies have looked to address this problem by attempting to standardise the discharge process.

In a scoping review of these studies Stelfox et al 6 noted that, while a range of tools have been developed to characterise discharge readiness, most studies have been single-centre and have not conducted comparative evaluations of different tools.

Increasingly ICUs are using clinical information systems (CISs) to collect, store and display physiological data. The availability of such routinely collected patient data presents the opportunity to apply methods from data science, with the potential to transform healthcare in a number of ways.7 8 Two particular avenues for development are the automation of simple tasks9 ⁠and the implementation of decision support systems,10 both of which could reduce the cognitive load of clinicians and free up scarce resource for tasks that require human expertise. This work considers the ICU discharge process, which has accessible data from routine collection and requires a simple but important binary decision that could benefit from an evidence-based approach. Indeed, several statistical models have recently been developed to predict the risk of adverse events following intensive care discharge.11–15 Such risk models are invaluable tools for clinical decision making and, in the context of ICU discharge, can provide information with which to plan complex de-escalations of care. For example, patients deemed to be at high-risk of readmission may benefit from continued close monitoring,16 since early detection of deterioration is a strong predictor of outcome.17 18

In our previous work on the psychology of clinical decision-making we have demonstrated the effectiveness of simple ‘nudge’ based interventions in changing clinical practice.19–21 Building on this foundation we were motivated to develop a classifier to automatically flag patients that appear physiologically fit for discharge. The intention is that such a screening tool could streamline morning ward rounds by allowing staff to focus their attention on the most likely-dischargeable patients. The tool could also prompt clinicians to consider discharge decisions at other times of day, outside of normal rounds. In 2003 Knight proposed a set of nurse-led discharge (NLD) criteria22 ⁠with a similar aim—to expedite discharge from a high-dependency unit by allowing nurses to discharge patients who were clearly well enough to leave. These criteria represent a general and highly conservative set of constraints on physiology that characterise a patient as suitable for care on an acute ward (level 1 care). High-risk patients are unlikely to meet these criteria, but may still be dischargeable by a consultant. In this study we use routinely collected patient data to retrospectively evaluate Knight’s criteria, and then improve on their performance using machine learning methods. To this end we study two historical cohorts. One cohort consists of patients treated on the general intensive care unit (GICU) at the Bristol Royal Infirmary between January 2015 and April 2017, while the second consists of patients selected from the Medical Information Mart for Intensive Care (MIMIC)-III database23 ⁠(see Methods section for details).

Methods

Discharge criteria

The NLD criteria proposed by Knight22⁠ consist of a set of constraints on various routinely collected vital signs and laboratory results. If a patient meets all the criteria for a period of at least 4 hours, Knight states that they may be safely discharged by a nurse. The motivation behind developing these criteria was to facilitate discharge by nurses in cases where the decision was clear, and there is some evidence of improved bed allocation when using such a nurse-led system.22 24 25 In order to test the NLD criteria on historical patient data we codified the constraints (see online supplementary file section A) into 15 binary tests, which are defined in table 1. For criteria that were not assigned numeric values in the original publication (B1-4, central nervous system) we used the ‘normal’ bounds as defined in our CIS.

Supplemental material

Table 1

Codified version of the discharge criteria for application to electronic health record data. Here the 15 criteria have been grouped into intuitive subsets and each assigned a test ID (‘R0’ to ‘B4’). According to the original specification, if all 15 criteria are met for a period of at least 4 hours the patient can be safely discharged

Cohort selection

Subjects for this study were selected from two distinct historical data sources to form two patient cohorts. The inclusion criteria are detailed in online supplementary file section B. The first data source consists of the routinely collected data from 1870 patients treated on the GICU at the Bristol Royal Infirmary. We refer to the cohort selected from this dataset as GICU. The second data source was derived from the MIMIC-III database,23 from which we selected patients who were admitted to medical or surgical intensive care since this approximates the patient type in GICU. We restricted our analysis to the ‘Metavision’ subset of MIMIC-III, since the labelling of the variables required to evaluate the NLD criteria was found to be more consistent in this portion of the database. Furthermore, we selected only the first intensive care stay of any given hospital admission, and only those stays for which there was a recorded callout (ready-for-discharge [RFD]) time. Following these criteria, we arrived at a subset of 7592 patients from MIMIC-III, forming the cohort we refer to hereafter as MIMIC.

The use of two cohorts was motivated by two concerns. First, by including the MIMIC cohort, we significantly increased the volume of data available for training classifier algorithms. Second, the use of two cohorts allowed us to study the generalisation of our results between different patient populations under different healthcare systems.

Readiness-for-discharge

The key to testing and improving on the discharge criteria was to be able to identify, from the historical data, patients that were RFD and not-ready-for-discharge (NRFD). Whereas previous models have looked to predict the occurrence of adverse events following ICU discharge12 15 ⁠we wanted to learn to classify those patients that appear physiologically fit to leave the unit. These are subtly different tasks. The former requires the identification of patients at risk of negative outcomes from those who have already been declared fit for discharge, while the later looks to identify, from a sample of ICU patients, those who are no longer in need of critical care. Clearly the latter is an easier task. In order to train a classifier for this task it was necessary to define instances of the positive (RFD) and negative classes (NRFD). Both datasets (GICU and MIMIC) contain a callout for each patient, which marks the time at which a patient was declared clinically ready to leave the ICU. A patient was defined as RFD at their time of callout, provided they had a positive outcome after leaving ICU. Conversely, patients with a negative outcome were defined as NRFD at their time of callout. A positive outcome was defined as the patient leaving hospital alive without readmission to ICU. A negative outcome was defined as either readmission to ICU during the same hospital admission, or in-hospital mortality after discharge from ICU. We note that it is more conventional to use readmission (or mortality) within 48 hours to define a negative outcome related to ICU care.12 26 However, this practice is not universal27 ⁠and in our case it was not possible because of limitations in the data available locally.

Given the low rates of negative outcome following callout in both MIMIC and GICU (see table 2), we generated further instances of the negative class, in order to balance the class sizes. Conceptually this is equivalent to providing more instances for the classifier to learn the physiological characteristics of patients requiring ongoing critical care. To do this we sampled patients at between 3 days and 8 days prior to their callout (see online supplementary file section B: figures 1–3), under the assumption that patients were NRFD at this point in time, regardless of their eventual outcome state (positive or negative). Patients within the first 24 hours of their ICU stay were omitted from this sample. Full details of the sampling procedure are given in online supplementary file section B.

Table 2

Patient characteristics for the two cohorts. Discharge delay defined as length of time between callout and discharge from intensive care unit (ICU). Readmission to ICU defined as readmission during same hospital stay. Negative outcome is in-hospital mortality and/or readmission

Feature extraction

We used the same feature set to evaluate the NLD criteria and to train machine learning classifiers. We constructed either one or two features corresponding to each of the NLD criteria, depending on the criteria in question and on data availability. For example, the features ‘resp min’ and ‘resp max’ were used to test the criterion R4, whereas the single feature ‘bun’ was used to test B4. Where possible the feature values were calculated from a 4 hours sample window, as specified by the original NLD criteria. In the cases where no data was available during the 4 hours window, an extended 36 hours window was used. This extended window was mainly relevant for infrequently measured laboratory test results (see online supplementary file section C: table 1). Full details and justification of the feature extraction procedure are provided in online supplementary file section C. Since this feature set is somewhat restrictive, consisting of 18 physiological features, we also defined an extended feature set that included the following extra features: age, sex, body mass index (BMI) and hours since admission.

To produce the results presented in the main text, missing feature values were imputed using k-nearest neighbour imputation.28 Full details of the imputation procedure are given in online supplementary file section D, along with a complete case analysis that addresses the sensitivity of our results to this imputation strategy. When training and testing the machine learning classifiers, features were standardised by subtracting the mean and dividing by the SD. The feature matrices for the imputed and complete case data sets are visualised using the t-Distributed Stochastic Neighbour Embedding (t-SNE) algorithm29 in online supplementary file section D: figures 4 and 5.

Analysis of NLD criteria

Knight originally specified that all 15 criteria must be met in order to allow safe discharge by a nurse.22 Following this specification, we evaluated the criteria for both MIMIC and GICU, determining which instances were classified as RFD and NRFD, and comparing these results to ground-truth. We then further investigated the performance of the NLD criteria as a classification system, by relaxing the constraint that all 15 tests must be passed in order to make an RFD classification. Instead we used the NLD criteria to produce probability estimates of being RFD, by summing the number of tests passed and dividing by 15 to produce a normalised output between 0 and 1. In this formulation each of the 15 criteria contribute equally to the RFD probability. Subsequently we weighted each of the criteria according to a measure of feature importance (see below) in order to improve their predictive performance. Using the probability outputs, it was possible to evaluate the performance of the NLD criteria in the same way as the machine learning classifiers described below.

Machine learning classifiers

To improve on the performance of the NLD criteria, we trained and tested two machine learning classifiers: a random forest (RF)30 and a logistic classifier (LC).31 These two algorithms were chosen for their simplicity in implementation and ease of interpretation in their predictive output. The training methodology we used was intended to produce classifiers that made good use of the training data that comes from multiple source domains, while generalising well to new patient populations. As such we employed multiple-source cross-validation.32 A single iteration of this procedure is as follows. Each source dataset is split into train and test data. For GICU 30% of the data is held out for testing. For MIMIC an equal sized test set is held out (~10%). Multiple-source cross-validation is then used to optimise the hyper-parameters on the training data (see online supplementary file section E) with two folds, one derived entirely from MIMIC and the other derived entirely from GICU. The optimised classifier is then retrained on the full training data (MIMIC and GICU), and its performance is tested on the held-out test data. This procedure is repeated over 100 random train-test splits to produce estimates of the mean and SD of classifier performances.

In order to determine the feature importances for each classifier, and therefore understand which features were most predictive of readiness-for-discharge, we calculated the permutation feature importance.33 In short, this procedure involves iterative random permutation of the values of each feature, and the calculation of average loss of classifier performance (we used area under the receiver-operator-characteristic [ROC] curve) resulting from this feature randomisation. The overall performance of our trained classifiers, and the NLD criteria, was characterised by producing ROC and precision-recall (PRC) curves,34 and by evaluating a suite of common performance metrics.

Results

The original specification of the NLD criteria proved to be highly conservative as expected, producing low false positive and true positive rates for both cohorts (online supplementary section D: tables 2–5). The true positive rates for MIMIC and GICU were 1.1% and 6.6%, respectively. Varying the threshold number of criteria required to make an RFD classification allowed us to produce ROC and PRC curves for the NLD criteria. These curves are illustrated in figure 1 for a single train-test data split. On this data split the NLD criteria obtained precisions of greater than 0.7 up to a recall of 0.6 for both cohorts. The RF using the extended feature set showed large performance gains on this data split, with precisions of greater than 0.8 up to a recall of 0.8.

Figure 1

Performance of the nurse-led discharge criteria and random forest with extended feature set (RFext) evaluated on held-out data for a single train-test split. Left: receiver-operator-characteristic curves with associated area-under-curve scores. Right: precision-recall curves. AUC, area-under-curve; GICU, general intensive care unit; NLD, nurse-led discharge; RF, random forest.

In general, the machine learning classifiers outperformed the NLD criteria. These performances, averaged over all 100 train-test data splits are summarised in table 3. The RF performed better than the LC on MIMIC, according to all performance metrics, when using both the original and extended feature sets. On GICU the RF and LC produced similar scores. For this cohort, the LC with the original feature set narrowly outperformed the RF according to all metrics, but only won on three metrics (Area under receiver operating characteristic (AUROC); partialAUROC (pAUROC); Brier score) when the extended feature set was used. Overall, when training and testing on the imputed dataset, the RF with extended feature set showed the best performance. The complete case analysis (online supplementary file section D: table 6) produced qualitatively similar results but there was a clearer distinction between classifiers, with the LC performing better on GICU and the RF performing better on MIMIC. Average receiver operating characteristics are summarised for all classifiers in the online supplementary file section D: tables 7 and 8.

Table 3

Performance metrics for the various classification systems

Broadly the two classifiers agreed as to which features were important in classifying patients as RFD (table 4). Eight of the features ranked in the top 10 by the LC were also ranked in the top 10 by the RF, and the Spearman’s rank correlation coefficient between the feature rankings was 0.761 (p=0.00002). Both classifiers ranked gcs_min and airway as the two most important features by a significant margin. There was little change in these feature rankings under the complete case analysis (online supplementary file section D: table 9). We attempted to improve the classification performance of the NLD criteria by weighting each of the criteria according to the corresponding feature importance given by the logistic classifier. This weighting produced small performance gains over the original criteria (see NLDweighted in table 3), but not enough to warrant their use instead of a machine learning classifier in a clinical setting.

Table 4

Feature importances given by the random forest (RF) and logistic classifier (LC), evaluated over 100 train-test data splits. Importance values are given as: mean (SD). Features are ranked according to mean importance value, and the table is ordered according to the ranking given by the LC

Discussion

Identifying which patients are suitable for ICU discharge is complex.1 6 Delayed and out of hours discharges are associated with increased mortality35 and patients in ICU who could be managed on the ward put an unnecessary strain on resources. The determination of RFD

status is influenced by many unmeasured factors, such as ICU census,25 and this leads to unwarranted variation in clinical decision making. Furthermore, the decision to declare someone fit for discharge is based on the judgement of individual clinicians and is likely to be given a lower priority than decisions about treatment options for patients who are more unwell.

In this study we have put forwards the concept of a decision support tool that would prompt clinicians to consider discharging a patient when they appear physiologically RFD. Such a prompt would occur by means of a dashboard notification, or ‘nudge’.20 It would need to be sufficiently sensitive as to recommend high numbers of potential discharges, while providing enough specificity to retain clinician engagement. Here we have detailed the development of two machine learning algorithms intended for such a purpose, and demonstrated their performance improvement over a previously published set of criteria that were originally aimed at discharge automation.22 At a threshold specificity of 0.7, the algorithm with best overall performance achieved mean sensitivities of 0.8909 and 0.9049 for the GICU and MIMIC cohorts, respectively (online supplementary file section D: table 7). This represents a relatively high rate of false positives and suggests that further development is required before a tool based on this approach could be deployed clinically.

The features identified as most important by the classifiers were clinically meaningful. Clinicians will recognise that coma score; respiratory function and renal function are strongly related to successful ICU discharge. Under the LC certain features, such as body temperature and creatinine, appeared to be less important than we expected. This may be, in part, a consequence of patient heterogeneity on the GICU.36 For example, body temperature may be predictive for patients with infection yet much of this predictive power is lost in our attempt to fit a general model for the whole ICU population. Similarly, although creatinine levels are indicative of renal function, persistently high creatinine should not be a criterion against discharge readiness for patients with chronic renal failure. The ability of the RF to better model such non-linear feature dependencies may explain why it gave a higher rank to these features.

In general, the performance of both classifiers would benefit from expanding the feature representation. The feature set we used was chosen to be directly analogous with the features tested by the discharge criteria. This feature set is restrictive, having been originally designed to be manually recorded by nurses using paper charts. We demonstrated that adding four extra features (age, sex, BMI and hours since admission) improved classification performance. However, machine learning methods have the power to further exploit the richness of the data held in electronic charting systems by including more physiological parameters, and learning the most predictive feature representation of these parameters.37 One barrier to this approach is the challenge of harmonising the data, especially when combining data from different sources. This is one reason that we did not include diagnosis codes or severity of illness scores in this study, although they have previously been shown to be predictive of adverse events following discharge.11 12 During a patient’s stay in ICU, many of their physiological parameters are controlled by clinical intervention, and their expected physiological state is dependent on their medical history (see, eg, guidelines on acceptable levels of haemoglobin in different patient types).38 Therefore, conditioning features on medical interventions and applying methods for patient sub-typing36 39 are two improvements that we expect could significantly boost performance. Also, although the complete case analysis did not qualitatively alter our results, the development of a more sophisticated multiple-imputation strategy40 ⁠would likely improve performance by making best use of the available training data and exploiting the value in missingness.41

A range of different tools and methods have previously been proposed with the aim of improving ICU discharge practice. These tools range from criteria to evaluate discharge readiness,22 42 to guidelines for discharge planning and education.6 Additionally, a number of risk models have been developed to predict adverse outcomes following ICU discharge.11–13 15 43 In particular Badawi and Breslow demonstrated that mortality and readmission should be modelled independently as separate outcomes.12 Clearly a comparative evaluation of the existing tools is required in a clinical setting. We argue that a future decision support system for discharge planning should draw elements from all these methods. A screening algorithm, such as the one we have outlined here, could notify clinicians of dischargeable patients in cases where the decision is clear. Decisions around high-risk patients, which are frequently required, would benefit from an extra level of decision support, such as individual predictions of mortality and readmission risk.12 The increasing availability of intensive care research datasets44 45 ⁠is sure to improve the performance and generality of such models, particularly as methods from transfer learning are applied.15⁠ Ultimately the benefit from these models comes from the manner in which they are deployed. We have shown in previous work that subtle changes to the presentation of information can have significant impact on clinical decision-making.20 The aggregate effects of the small improvements produced by such approaches could be widely beneficial.46 We suggest that the proposed decision support system would maximise engagement by addressing issues of model interpretability,47 48⁠ and could leverage clinical expertise by learning online with a human-in-the-loop.49

Conclusion

This work outlines a framework for the use of machine learning algorithms to identify patients that are physiologically fit for discharge from the ICU. A decision support tool based on these methods could contribute to the solution of this significant clinical and operational problem by streamlining the discharge process and reducing unnecessary ICU stay. We have identified a number of improvements that would be required before the deployment and testing of such a tool in a clinical setting, and highlighted how the tool would benefit from the inclusion of multiple complementary modelling frameworks. As more patient data becomes available in the wider hospital setting there is extensive scope to use data-driven methods, such as the one presented here, to improve patient flow through hospitals.

Acknowledgments

We would like to thank Graeme Palmer, Amy Weaver and Russell McDonald-Bell for their support in accessing and understanding the GICU data.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.

Footnotes

  • Contributors CJM is the main author and conducted the data processing and analysis. RS-R and DJL made important technical and methodological contributions. IDG, AC and CPB drove the study concept and design. The clinical expertise of THG, MJCT and CPB informed all stages of the project, in particular study design and interpretation of results. All authors contributed to writing the manuscript and approved the final version.

  • Funding This work was supported in part by EurValve (Personalised Decision Supportfor Heart Valve Disease), Project Number: H2020 PHC-30-2015, 689617. CJM was funded by the Elizabeth Blackwell Institute Catalyst Fund. DJL is funded by Wellcome Trust and Royal Society Grant Number WT104125MA.

  • Competing interests None declared.

  • Ethics approval CAG guidelines followed and study protocol presented to University Hospitals Bristol NHS Foundation Trust institutional review board.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement Feature matrices will be made available on Dryad. Python code for analysis and processing on request directly from the corresponding author.

  • Patient consent for publication Not required.