Early warning score adjusted for age to predict the composite outcome of mortality, cardiac arrest or unplanned intensive care unit admission using observational vital-sign data: a multicentre development and validation

Objectives Early warning scores (EWS) alerting for in-hospital deterioration are commonly developed using routinely collected vital-sign data from the whole in-hospital population. As these in-hospital populations are dominated by those over the age of 45 years, resultant scores may perform less well in younger age groups. We developed and validated an age-specific early warning score (ASEWS) derived from statistical distributions of vital signs. Design Observational cohort study. Setting Oxford University Hospitals (OUH) July 2013 to March 2018 and Portsmouth Hospitals (PH) NHS Trust January 2010 to March 2017 within the Hospital Alerting Via Electronic Noticeboard database. Participants Hospitalised patients with electronically documented vital-sign observations Outcome Composite outcome of unplanned intensive care unit admission, mortality and cardiac arrest. Methods and results Statistical distributions of vital signs were used to develop an ASEWS to predict the composite outcome within 24 hours. The OUH development set consisted of 2 538 099 vital-sign observation sets from 142 806 admissions (mean age (SD): 59.8 (20.3)). We compared the performance of ASEWS to the National Early Warning Score (NEWS) and our previous EWS (MCEWS) on an OUH validation set consisting of 581 571 observation sets from 25 407 emergency admissions (mean age (SD): 63.0 (21.4)) and a PH validation set consisting of 5 865 997 observation sets from 233 632 emergency admissions (mean age (SD): 64.3 (21.1)). ASEWS performed better in the 16–45 years age group in the OUH validation set (AUROC 0.820 (95% CI 0.815 to 0.824)) and PH validation set (AUROC 0.840 (95% CI 0.839 to 0.841)) than NEWS (AUROC 0.763 (95% CI 0.758 to 0.768) and AUROC 0.836 (95% CI 0.835 to 0.838) respectively) and MCEWS (AUROC 0.808 (95% CI 0.803 to 0.812) and AUROC 0.833 (95% CI 0.831 to 0.834) respectively). Differences in performance were not consistent in the elder age group. Conclusions Accounting for age-related vital sign changes can more accurately detect deterioration in younger patients.


REVIEWER
Guy Ludbrook University of Adelaide, Australia I have ben a co-author on other work with one of the paper's authors.

GENERAL COMMENTS
This manuscript examines the performance of an early warning system (EWS) derived from a large database when age adjustment is made in 2 bands (above and below 45 years of age) compared to the more usual addition of age as a variable. It finds statistically significant, but absolutely small, improvements in performance in prediction of major adverse events when compared to more widely used models. In general, the data is well explained, and the results clear. There are some specific questions.
Elective and maternity admissions were excluded. As deterioration after e.g. elective surgery is an increasingly recognised issue, should scores not be broadly applicable to have good clinical utility? Why is 45 years chosen as dividing point? The data provided has a number of age bands. For example, it is notable from Figure 2 and Appendix B that alert thresholds vary more continuously rather than the simple above or below 45 years What is the impact of using different or more dividing points? Does this at an infinite extension devolve to NEWS-type performance? It is noted that absolute numbers of adverse events is much smaller in the < 45 years groupdoes this impact on the model performance?
It appears that abnormal observations occurring within 24 hours of an adverse event are used as predictors. What is the impact of accounting for trends in observations? If an abnormal observation is followed by more normal values, does this alter predictions? Might the data be interpreted by clinicians differently? Why is 24 hours chosen? Gains in performance for < 45 years are stated as marginal, and Appendix A shows this improvement is small relative to the uncertainty of model performances overall. While the use of the EWS in clinical practice is acknowledged, can the authors comment on how specifically this may enhance practice outcomes or system efficiency. For example, has any economic analysis of the use of these systems been conducted, and/or compared to the use of alternate methods of adverse event prevention?
Materials and methods, Data Source, page 5. It would be a great help with a more detailed description of the kind of hospitals the patient data are derived from (size, catchment area, services provided, no. of beds and admissions), to help the reader get an understanding of what type of patients the data sets are derived from. We added in-text information on the types of hospitals (page 5) based on the suggestion of the reviewer. Additional publicly available information about the hospitals is included in Appendix A - Table A2. We did not include this in the main text because we do not have consistent information about all hospitals, such as for size.

2.
Material and method, model description, page 7, "In(5,6), the alerting thresholds corresponded…". The sentence needs clarification. I presume the numbers in parenthesis refers to two studies, it would be a great help to have that spelled out more clearly. We added more detail on the EWS systems referenced by the citations in the same line (page 7 in the manuscript).

3.
Results. It says that 233,510 patients were included in the PH validation set, however the number is 233,632 according to figure 1. Thanks for pointing this out. We checked our code and dataset and the correct number is 233,632. It seems that 233,510 is from an outdated manuscript. We amended the number in the Results Section (page 8 in the manuscript) and in Table 1.
Elective and maternity admissions were excluded. As deterioration after e.g. elective surgery is an increasingly recognised issue, should scores not be broadly applicable to have good clinical utility? Maternity admissions usually have their own scores. We agree with the point of the reviewer on elective admissions. However, the primary objective of this paper is to check whether the inclusion of age has an additional value compared to existing scores, so we used a comparable population to NEWS/MCEWS which did not include elective surgical admissions. We will consider in our future work because the incidence of events across elective surgical admissions is also low. This has been added to the discussion section of the paper (page 9).

2.
Why is 45 years chosen as dividing point? The data provided has a number of age bands. For example, it is notable from Figure 2 and Appendix B that alert thresholds vary more continuously rather than the simple above or below 45 years What is the impact of using different or more dividing points? Does this at an infinite extension devolve to NEWS-type performance? The median age in the development sets of both ViEWS (currently known as NEWS) and MCEWS was greater than 60 years. We therefore chose 45 years old to give sufficient separation from the median age in the baseline systems (Last sentence on page 5).

3.
It is noted that absolute numbers of adverse events is much smaller in the < 45 years groupdoes this impact on the model performance? The imbalanced dataset is certainly a limitation for model development and it is a common problem in the field. Despite the low numbers, our proposed methodology performs better for younger patients than the existing baselines.

4.
It appears that abnormal observations occurring within 24 hours of an adverse event are used as predictors. What is the impact of accounting for trends in observations? If an abnormal observation is followed by more normal values, does this alter predictions? Might the data be interpreted by clinicians differently? The reviewer raises a very good point. In fact, current EWS systems consider the observations to be independent and identically distributed (I.I.D.) and therefore they do not account for trends in the observations. Currently, clinicians only examine the most recently collected set of measurements. We hypothesise that accounting for trends may improve the performance and this is an area of future study. This is mentioned in Paragraph 4 in the Discussion section (page 9 in the manuscript)

5.
Why is 24 hours chosen? We chose 24 hours because it is the most commonly assessed timeframe in relevant studies, especially for the baseline studies considered in our paper, namely MCEWS and NEWS (citations 3