Objectives To develop predictive models for blood culture (BC) outcomes in an emergency department (ED) setting.
Design Retrospective observational study.
Setting ED of a large teaching hospital in the Netherlands between 1 September 2018 and 24 June 2020.
Participants Adult patients from whom BCs were collected in the ED. Data of demographic information, vital signs, administered medications in the ED and laboratory and radiology results were extracted from the electronic health record, if available at the end of the ED visits.
Main outcome measures The primary outcome was the performance of two models (logistic regression and gradient boosted trees) to predict bacteraemia in ED patients, defined as at least one true positive BC collected at the ED.
Results In 4885 out of 51 399 ED visits (9.5%), BCs were collected. In 598/4885 (12.2%) visits, at least one of the BCs was true positive. Both a gradient boosted tree model and a logistic regression model showed good performance in predicting BC results with area under curve of the receiver operating characteristics of 0.77 (95% CI 0.73 to 0.82) and 0.78 (95% CI 0.73 to 0.82) in the test sets, respectively. In the gradient boosted tree model, the optimal threshold would predict 69% of BCs in the test set to be negative, with a negative predictive value of over 94%.
Conclusions Both models can accurately identify patients with low risk of bacteraemia at the ED in this single-centre setting and may be useful to reduce unnecessary BCs and associated healthcare costs. Further studies are necessary for validation and to investigate the potential clinical benefits and possible risks after implementation.
- diagnostic microbiology
- accident & emergency medicine
- internal medicine
Data availability statement
Data are available on reasonable request. The data that support the findings of this study are available from the corresponding author on a reasonable request and when allowed by local privacy regulations.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
These models are based on routinely collected clinical data that are available at the end of a visit at the emergency department and are therefore applicable to implement in clinical practice.
Free-text data, such as physician and nurse reports, could not be used due to privacy concerns.
These models should not be used in patients at high risk for bloodstream infections caused by pathogens that are usually reported as contaminants, such as with central line associated infections.
This is a single-centre study and further studies on validation and implementation are necessary to investigate possible risks and likely benefits of these models.
Over 20% of adult emergency department (ED) visits occur due to serious infections.1 Current diagnostic modalities cannot sufficiently distinguish between bacterial and non-bacterial disease during an early stage of a diagnostic workup, for instance in case of a possible bacteraemia (bloodstream infection).2 However, timely distinction between bacterial and non-bacterial disease can reduce unnecessary diagnostic tests and treatment with antibiotics. In case of a bacteraemia, blood cultures (BCs) are the gold-standard test. Unfortunately, turnaround times of BC results of 24–72 hours make these cultures unhelpful for timely diagnosis of bacterial infections at the ED. Accurate and early identification of patients with a high or low risk of bacteraemia may be a first step to help distinguish bacterial from non-bacterial disease early.
Bacteraemia is associated with high morbidity and mortality, which makes missing a possible bacteraemia very harmful.3 Therefore, physicians order BCs frequently and the overall BC yields are low.2 Around 11%–15% of collected BCs are positive and studies show that up to half of those are false positives through contamination.4–6 These contaminated BCs can also lead to unnecessary downstream diagnostics, antibiotic overuse and increased hospital length of stay.7–9 Currently, we are unable to recognise patients with low risk of bacteraemia, in which we could safely withhold BC testing and even antibiotics.
Machine learning already has significant impact on healthcare. Machine learning models can use many data points from large numbers of patients to detect subtle patterns that may go unnoticed by healthcare professionals. These insights may support the swift assessment of a patient and selection of the appropriate diagnostic and treatment strategies. Complex situations, where multiple physiological mechanisms interact are perfect areas to investigate machine learning decision support.10 The diagnostic workup of suspected bacterial infections is such an area.
In this paper, we aim to create predictive models for BC outcomes in the ED setting which may help reduce unnecessary BCs and provide physicians with an additional tool to help decide whether or not antibiotic treatment is needed. We specifically focus on creating a machine learning pipeline that can be easily adapted and available in many settings.
We performed a retrospective observational study on data from the electronic health records (EHR) of Amsterdam UMC, location VU University Medical Center, between 1 September 2018 and 24 June 2020. The VU University Medical Center is a large teaching hospital with an estimated 28 000 ED presentations annually. The study adhered to the ‘transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD)’.11
We included all adult patients who presented to the ED and in whom at least one BC was taken during their ED stay because a bacterial infection was suspected on clinical grounds. We included patients of all medical specialties. Whenever a patient presented to the ED multiple times during the study period, each encounter was classified as a unique visit.
All data that were available under local privacy regulations was extracted from the EHR. The data included demographic information, vital signs, laboratory results and information about imaging procedures and administered medications in the ED. Data on comorbidities or medication usage at home were not available. We only used data that would be available before the end of the ED visit, which is the time when the prediction can potentially have clinical consequences on the use of BCs and initiation of antibiotic therapy. The data extracted from the EHR was further preprocessed to be used for predictive modelling. Details about preprocessing are described in online supplemental e-methods and e-tables 1–4.
We aimed to predict bacteraemia, which was defined as at least one positive BC with a pathogenic microorganism collected during the ED visit.
AWB and MS mapped all microorganisms to be pathogens or contaminants based on previous literature under supervision of WJW.2 4 12 13 Online supplemental e-table 5 lists all organisms that we classified as contaminants. Then, we assigned the most important result to a specific BC set (prioritising positive over contamination over negative). Afterwards, the combination of all BC sets in a unique ED visit was mapped to represent a visit with growth of a clinically significant pathogen in at least one BC set (positive) or a visit with only negative or contaminated cultures (negative).
Model development and feature selection
We used all variables that were reported in over 10% of the ED visits as features. We also created indicator features for all variables to indicate whether this variable was measured or not. The dataset was randomly split into a training (75%) and test (25%) set for model development. We used median imputation except for some situations where imputation based on domain knowledge was used (see online supplemental etable 6 for details). Median imputation is a practical and adequate solution for handling missing data in non-linear models. Furthermore, the combination of median imputation and indicator features as we used is also adequate for linear models, especially with data missing not at random.14 Additional standard scaling around the mean was applied. We trained the models on the training set using the full set of features, since the used models are robust to unimportant features.
We used a gradient boosted tree model and a logistic regression model with L1 regularisation. These different model classes are known to be suitable for our type of data, which is limited in size and of mixed type. We used gradient boosted trees as a powerful representative of tree-based models, which can uncover complex feature interdependencies and non-linearities. We also used a simpler logistic regression for comparison, since its coefficients are easier to interpret.
Within the training set, a fivefold cross-validated grid search was performed to find the hyperparameters that optimise the model’s performances. An overview of the pipeline from raw data to model can be found in the e-methods section of online supplemental appendix.
Modelling was performed using Python V.3.7.9 (Python software foundation, http://www.python.org) and the Scikit-learn package (V.23.1).
The model performances were tested using the area under curve of the receiver operating characteristics (AUROC), together with the area under the precision recall curve (AUPRC) since we had imbalanced outcome classes. We also reported Brier scores and F1-scores during cross-validation as well as on the test set. The model calibration is presented in calibration plots.
The model’s output was the probability for the BC to be positive. To provide a clinically meaningful result, we report on two preselected probability thresholds that predict BCs to be positive above this threshold. First, we show performances on the most optimal sensitivity-specificity threshold based on maximisation of the sensitivity-specificity sum or minimisation of the sensitivity-specificity difference.15 These approaches are useful when omission errors (false negatives) should be avoided and provide a diagnostic test with the power to rule out a diagnosis.15 16 Furthermore, we present model performances on a threshold that retains a sensitivity of 90%, which is in line with our goal of using it to identify patients in which we can safely withhold collecting a BC.
Patient and public involvement
Patients were not involved in setting the research question, design of the study, outcome measures and interpretation of the study.
We identified 51 399 ED visits by 41 280 unique adult patients in the VU University Medical Center between 1 September 2018 and 24 June 2020. One or more BC samples were taken in 4885 (9.5%) of those visits. In 598/4885 (12.2%) of those visits, at least one of the cultures was a true positive. In 254/4885 (5.2%) of the visits, at least one of the cultures was contaminated (later mapped to be negative). Overall, 4074/4885 (83.4%) visits had only truly negative cultures. Table 1 shows the baseline characteristics of the study population stratified by culture outcomes.
The gradient boosted tree model’s AUROC in the cross-validation (training) sets and internal test set were 0.77 (SD=0.03) and 0.77 (95% CI 0.73 to 0.82), respectively. The logistic regression model’s AUROC in the cross-validation and internal test set were 0.75 (SD=0.02) and 0.78 (95% CI 0.73 to 0.82). The AUROCs of both models are shown in figure 1. Table 2 shows the corresponding performance scores. The calibration plots are presented in online supplemental e-figure 1.
Gradient boosted trees
Feature importances for non-linear tree based models only indicate the magnitude and not the directionality (positive/negative) of the effect. We present the feature contributions using shapley additive explanation values, as depicted in figure 2.17 These are distributions of local contributions per feature and per data point. Figure 2 shows the 20 most important features that drive predictions in the gradient boosted tree model (see online supplemental e-table 2–4 for the full lists of features). This model recognises bilirubin values to be the strongest predictor of a positive BC. We see that high (red) bilirubin values are associated with a higher risk of a positive BC (right on the x-axis). Conversely, high (red) potassium levels are associated with a lower risk of a positive BC (left on the x-axis).
The 20 features with the largest absolute coefficients in the logistic regression model are presented in figure 3. Age and lymphocyte counts are the strongest predictors. A high age is associated with a higher risk of a positive BC, whereas a high lymphocyte count is associated with a lower risk (see online supplemental e-table 7 for a full list of coefficients). Due to the imputation and the fact that physiological parameters are not strictly independent of each other, no valid estimation of the ORs can be provided.
The models sensitivity and specificity depend on the probability threshold that is used to predict a positive or negative BC. Table 3 presents model performances for the optimal sensitivity-specificity threshold and a threshold that retains a sensitivity of 90%. The optimal threshold in the gradient boosted tree model would predict 69% of BCs in the test set to be negative, with a negative predictive value of over 94%. An extensive list of thresholds and corresponding performances in both sets can be found in online supplemental e-table 8 and 9.
Medication administered in the ED
In coming to the final models, we evaluated the effects of excluding different groups of features, such as medications given in the ED. Excluding all ED medication features led to comparable model performances (see online supplemental e-table 10 for details). When including the ED medication features, almost none provided predictive value, except for the administration of antibiotics (see online supplemental e-figures 2 and 3). Because this event may be associated with the physician’s suspicion of bacteraemia, we decided to exclude ED medication features in order to retain a model that can augment physician decision making instead of depending on it.
We present two models that aim to predict the outcome of a BC that is drawn during an ED visit. Both a gradient boosted tree model and a logistic regression model show comparably good performance in predicting BC results with AUROCs of 0.77 (95% CI 0.73 to 0.82) and 0.78 (95% CI 0.73 to 0.82) in the test sets, respectively. In a population where the physicians has made the decision to draw a BC, the models can identify patients in the ED with low risk for bacteraemia and can be useful to reduce unnecessary BCs and provide physician decision support on the necessity of antibiotic therapy.
Many studies have aimed to identify factors associated with positive BCs or predict BC outcomes. A 2012 systematic review reported on 35 studies that evaluated the performance of clinical variables to detect bacteraemia.2 Those clinical variables alone seemed insufficient to detect bacteraemia and further studies on this subject have focused on more advanced predictive models to detect bacteraemia. A 2015 systematic review presented fifteen machine learning models that predicted BC outcomes.18 An additional few were published since.19–22
The various studies on this subject have been conducted in different settings, where the reasons for drawing BCs vary. We focused on the ED setting, as the legacy of a probable diagnosis of infection at the ED greatly influences decision-making throughout the hospital stay, especially with regards to antibiotic treatment.23 Based on the 2015 systematic review, only two other studies have been carried out fully in an ED setting.18 24 25 Those models showed AUROCs of 0.75 and 0.74 in the test sets. Of those two studies, the one by Shapiro et al24 has had the most influence on clinical practice, as the Shapiro decision rule has been studied and used in hospitals around the world.5 24 26 Our algorithms, with AUROCs of 0.77 and 0.78, perform at least as well as the Shapiro model and are only based on regularly captured EHR data. The major difference between previous models and our study is that those earlier models were trained on data that were prospectively collected by researchers. This manual data collection resulted in few missing values, with 97.6% of laboratory data being available.24 This will not occur in clinical practice and may lead to dramatic losses in predictive performance in implementation studies, when missing values need to be imputed in order to do any prediction. Therefore, these models have less potential for daily use in clinical practice and it will be difficult to implement them successfully.
Another aspect of the manual data collection in earlier studies is that predictors like the suspicion of endocarditis, which was an important predictor of BC outcomes, could be used.24 This is very specific data that will rarely be available in the EHR, which again limits the translation to clinical practice and automation of the prediction within an EHR environment. As we illustrate here, the use of data that is not routinely captured in clinical practice is one of the key reasons why none of these prediction models have been implemented in clinical practice yet.18 In contrast, the overarching approach we used, with a machine learning pipeline that incorporates variables measured in certain percentages of patients, ensures that the use of this pipeline in other hospitals will produce usable models that are slightly adapted to the particular setting of that hospital. Our approach can thus straightforwardly be implemented in various setting in clinical practice, without the need of additional data capture.
Most of the literature on BC predictions focuses on the intensive care unit (ICU) setting. Recent examples are models created by Roimi et al and van Steenkiste et al.20 21 Those models show excellent performances with AUROCs of up to 0.98 in the critical care setting. These models are trained on temporal trends that have occurred over a period of at least 48 hours, in contrast with the short and heterogeneous ED visits during which patients are not constantly monitored and where time-series data is rarely captured. Also, the approaches as taken for most intensive care unit (ICU) models seem to be overfitting to the training data and will likely perform worse in an external validation. This is underscored in the model by Roimi et al, in which the AUROC decreases from 0.92 to 0.60 during external validation.21
The main clinical value of our predictive model lies in the ability to identify patients at low risk of a positive BC, in a population where the physician has decided that a BC draw should be performed. The prediction can be made at the end of the ED visit and can identify patients in which we can safely withhold BC testing. Even in cases where BCs are already taken, there would be the option to not go through with the analyses, where most of the costs and associated harms are made. We showed that we would be able to withhold BC draws or analyses in almost 70% of the population while still retaining a negative predictive value of over 94%.
Our algorithm also has added value with regards to treatment selection, especially in cases with high diagnostic uncertainty at the end of an ED visit. The BC outcome prediction can be used as decision support tool to decide whether or not antibiotic treatment is needed. Estimated rates of unnecessary antibiotic use at the ED are over 30%, and it has been described as the most preventable cause of antibiotic resistance.27 Predictions of negative BCs can be an additional argument for withholding antibiotic treatment at that point and may help avoid unnecessary courses of empiric broad-spectrum antibiotics that can sometimes be given for several days due to delays in the turnaround time of BC results.28 When a specific infection such as pneumonia is very likely, then antibiotic treatment will be initiated regardless of the BC draw. However, in these cases our algorithm can still be used to withhold unnecessary BC testing.
Another clinically relevant aspect of this study is that we were able to show that routine laboratory results are associated with positive BCs. A low lymphocyte count appears to be related to a positive BC. This association has been described in earlier studies, but this variable has not been included in bacteraemia prediction models up until now.29 30 Bilirubin is another notably strong predictor of a positive BC. Elevated bilirubin levels have been observed in patients with sepsis, and it is included in prognostic scores, such as the Sequential Organ Failure Assessment (SOFA) score for sepsis.31 32 The association with positive BCs of other variables such as thrombocyte counts, temperature, blood pressure, heart rate and age is in line with previous studies.2 24 33
The main strength that distinguishes this work from what has been done before is the comprehensive pipeline from raw data to model. The preprocessing and feature engineering phases were conducted in collaboration with a machine learning scale-up company (Pacmed, the Netherlands), which has considerable experience with machine learning in healthcare. The strategy towards the selection of features and algorithms that were used to predict BC outcomes presents a significant improvement over currently accepted methods in the medical literature since they provide a way to adapt the models to a specific hospital environment and thereby using the strengths of machine learning. Our pipeline used all available data so that the models themselves would decide on the importance of any feature.
With this approach, the models were not limited by the selection of features through current medical knowledge and had the potential to discover unknown associations with bacteraemia. Throughout preprocessing stages, we put emphasis on only using data that would routinely be available at the end of the ED visit, when the final treatment and admission decisions have to be made. This approach facilitates straightforward implementation of the models in clinical practice, without the need for additional data capture. Finally, we compare the results of the more complex gradient boosted tree model with a simpler logistic regression that is easier to understand for physicians, to improve the overall interpretability.
There are several important limitations within this study. First, defining a positive BC is difficult. Our definition of contamination, which was defined as BCs that grew pathogens that are generally considered contaminants, is in line with previous literature.2 4 12 13 However, we were not able to incorporate clinical characteristics when determining the positivity of the outcome, as is often done in practice. Therefore, it is still possible that samples that were mapped as contamination actually represented a true pathogen according to the operational definition in practice. However, the true positive rate of collected BC’s in our population was somewhat higher than those described in previous literature.4 6 33 34 This may be due to conservative mapping of pathogens to likely contaminants. A related limitation is that the model should not be used when a physician wants to detect a clinically relevant blood stream infection with pathogens that we considered to be contaminants, as with suspected central line-associated bloodstream infections (CLABSI). Our algorithm should be used as additive to the clinical pretest probability of bacteraemia, based on syndromes with a high likelihood of bacteraemia reported in earlier studies.35
Another limitation of this study is that various potentially predictive variables could not be adequately extracted from the EHR system. Comorbidities, medication at home and placement of lines are not well documented within the EHR and this data would not be reliable enough to use in a prediction model. Furthermore, we were not able to use free-text data due to privacy concerns. Therefore, we could not use physician and nurse reports.
A final important limitation is that this study is performed in a single-centre setting and external validation of the models is necessary. Not all variables will be available in each hospital worldwide due to heterogeneity between healthcare systems. A strength of using machine learning algorithms in clinical practice, as opposed to static and general risk scores such as the Shapiro decision rule, is that they can adapt to the local situation and change over time. However, to maintain this advantage, a dedicated effort to use our extensive data pipeline in each individual hospital is necessary in order to adapt to the local situation. This requires a considerable time investment.
Our current study gives rise to several potential follow-up studies. First, external validation is a key aspect to ensure that we find a true signal of positive BCs and that there is little overfitting to confounding factors in our single centre. External validation of the exact algorithm we used in our hospital is hard, since all variables need to be measured in the other centre as well. Therefore, there are two main options for external validation. We either need to use our complete pipeline and create a modified model which we test specifically in a different centre. Or else, we will need to simplify the current model and only select the most important and generally measured features, so that the exact model can be tested in other settings. Furthermore, we also need to prospectively validate the findings through an integration of the model into the realtime EHR environment before we come to an intervention study with our model. This way, we can observe whether the model performance remains stable over time or whether systematic retraining protocols are needed.
Additionally, there is a need to further explore variables that are highly associated with positive BCs. If we start measuring such factors in clinical practice, then they can easily be incorporated in the algorithms. For example, various studies have shown that procalcitonin can predict BC positivity with good performance.19 36 We would be very interested to see the performance of a model created based on our pipeline in a hospital that regularly measures procalcitonin, as this may improve the performance substantially. Another important step could be to include additional clinical information by using free-text data.
In conclusion, we created two models that predict BC outcomes in the ED with AUROCs of 0.77 (95% CI 0.73 to 0.82) and 0.78 (95% CI 0.73 to 0.82) in this single-centre setting. The models are based on routinely captured clinical data and are therefore well suited for implementation in clinical practice. Further research is necessary for external and prospective validation of the models and implementation studies to identify potential benefits and possible risks. The main value of these models lies in the ability to identify patients at low risk of bacteraemia, which can help reduce unnecessary BC testing and provides an additional tool to decide whether antibiotic treatment is needed. Based on the model predictions, we would be able to withhold BC testing in 70% of the population with few omission.
Data availability statement
Data are available on reasonable request. The data that support the findings of this study are available from the corresponding author on a reasonable request and when allowed by local privacy regulations.
Patient consent for publication
The study protocol was assessed by the local Medical Ethics Review Committee, which determined that a formal ethical review was not required according to the Dutch Medical Research Involving Human Subjects Act (WMO).(IRB number: IRB00002991; case: 2020.486). An appropriate registration with the local privacy officer was performed and the need for informed consent was waived. Only pseudonymised data were used in this study.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
AWB and MS contributed equally.
Contributors AWB and MS contributed equally to this paper. PWBN was the principal investigator and guarantor of the study. AWB, MS, LCAP and PWBN designed the study. ESvdE, LCAP, AYvdZ, PWGE, MHHK, AWB, MS and PWBN contributed to the acquisition of data. LM, JZ, MGS, AWB, MS, WJW and PWBN did the preparation of the data. LM, JZ and MGS were responsible for analysis of the data. AWB, MS, PWBN, TCM, PWGE and RdJ contributed to the interpretation of the data. AWB, MS and LM drafted the first version of the manuscript. All authors revised the manuscript and approved the final version for publication.
Funding This project was funded by a research grant (no grant number available) from the Dutch federation for acute internal medicine (NVIAG).
Disclaimer The funder had no involvement in any part of the study.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.