Objective Susceptibility of patients with cancer to COVID-19 pneumonitis has been variable. We aim to quantify the risk of hospitalisation in patients with active cancer and use a machine learning algorithm (MLA) and traditional statistics to predict clinical outcomes and mortality.
Design Retrospective cohort study.
Setting A single UK district general hospital.
Participants Data on total hospital admissions between March 2018 and June 2020, all active cancer diagnoses between March 2019 and June 2020 and clinical parameters of COVID-19-positive admissions between March 2020 and June 2020 were collected. 526 COVID-19 admissions without an active cancer diagnosis were compared with 87 COVID-19 admissions with an active cancer diagnosis.
Primary and secondary outcome measures 30-day and 90-day post-COVID-19 survival.
Results In total, 613 patients were enrolled with male to female ratio of 1:6 and median age of 77 years. The estimated infection rate of COVID-19 was 87 of 22 729 (0.4%) in the patients with cancer and 526 of 404 379 (0.1%) in the population without cancer (OR of being hospitalised with COVID-19 if having cancer is 2.942671 (95% CI: 2.344522 to 3.693425); p<0.001). Survival was reduced in patients with cancer with COVID-19 at 90 days. R-Studio software determined the association between cancer status, COVID-19 and 90-day survival against variables using MLA. Multivariate analysis showed increases in age (OR 1.039 (95% CI: 1.020 to 1.057), p<0.001), urea (OR 1.005 (95% CI: 1.002 to 1.007), p<0.001) and C reactive protein (CRP) (OR 1.065 (95% CI: 1.016 to 1.116), p<0.008) are associated with greater 30-day and 90-day mortality. The MLA model examined the contribution of predictive variables for 90-day survival (area under the curve: 0.749); with transplant patients, age, male gender and diabetes mellitus being predictors of greater mortality.
Conclusions Active cancer diagnosis has a threefold increase in risk of hospitalisation with COVID-19. Increased age, urea and CRP predict mortality in patients with cancer. MLA complements traditional statistical analysis in identifying prognostic variables for outcomes of COVID-19 infection in patients with cancer. This study provides proof of concept for MLA in risk prediction for COVID-19 in patients with cancer and should inform a redesign of cancer services to ensure safe delivery of cancer care.
- risk management
Data availability statement
All data relevant to the study are included in the article or uploaded as supplemental information.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
The study uses novel analytical methods derived from machine learning to evaluate risk of COVID-19 in patients with cancer from hospitalisation to mortality.
Statistical and machine learning methods are compared to develop a profile of factors that can worsen outcomes from COVID-19 in patients with cancer.
The study analyses COVID-19 outcomes in patients with solid organ cancer in a cohort covering a single UK metropolitan region only. No haematological malignancies analysed.
Patients with COVID-19 and cancer who did not require admission to hospital were not included in this study.
SARS-CoV-2 leads to COVID-19.1 2 This highly transmissible disease has led to a global pandemic contributing to significant morbidity and mortality. Increased susceptibility and severity of COVID-19 are attributed to increasing age, smoking status, chronic obstructive pulmonary disease, diabetes mellitus, obesity, cardiovascular disease as well as cancer.3–6 In addition, the prevalence of all types of active or previous cancer in the UK is reported at 2.5 million cases with an incidence of 1000 newly diagnosed cases each day.7 Increased susceptibility to COVID-19 in patients with cancer has been attributed to immune suppression and cancer treatments such as cytotoxic chemotherapy and immunotherapy.8 9 However, it is still not established whether this translates into increased hospitalisation, illness severity or mortality risk. Risk-adjusted models quote a mortality risk of between 25% and 39% in patients with cancer hospitalised with COVID-19.10 With increasing prevalence of COVID-19 in the UK, the impact of cancer on COVID-19 remains an area of active concern.
In addition, machine learning algorithms (MLAs) have become increasingly applied in healthcare settings due to their prognostic utility.11–13 They are able to map a large number of observed variables (features) to target outcomes, and through statistical analysis, find relationships without human instruction.14 15 This utility has been exploited in cancer research to model risk of susceptibility, survival and recurrence.15 For example, in breast cancer, algorithms have been developed from detecting breast tumours to determining the prognostic significance of the tumour’s morphological features.14 Moreover, the ability to integrate diverse variables including clinical, biochemical, histopathological, genomic and proteomic data could lead to more reliable predictive models to determine disease outcome.16 17 Furthermore, the scalability of MLA distinguishes it from traditional statistical modelling, such as regression analysis, by its ability to perform non-linear modelling using large volume data sets and greater number of variables from registries.11 12 MLA models for risk prediction are starting to be validated in large studies.18 Thus, this nascent technique holds promise for developing better risk assessment and prognostic algorithms to support healthcare delivery and individualised patient care.
This study aims to quantify the risk of hospitalisation in patients with active cancer using specific differences in clinicopathological and biochemical parameters between patients with COVID-19 with cancer and those without cancer through developing an MLA. We seek to identify the most important determinants of high risk of susceptibility and mortality from a diverse range of variables. This will both provide proof of concept for our method as well as inform the recalibration of cancer services to ensure safe care for patients with cancer during the pandemic.
A single UK centre retrospective cohort study was conducted. Data on total hospital admissions between March 2018 and June 2020 were obtained from the local information technology department with a record of all hospital admissions before and during the COVID-19 pandemic linking this with the Somerset cancer database to extrapolate the total number of patients with active cancer who were admitted during the study period. Furthermore, all active solid organ cancer diagnoses between March 2019 and June 2020 were obtained from the local cancer network. This was used to determine the total number of patients with active cancer with COVID-19 with the denominator being the total and active cancer population in the Dudley, West Midlands (UK) region. Biochemical and haematological parameters in the first 48 hours of admission along with 30-day and 90-day post-COVID-19 survival were determined.
Patients below the age of 18 years and those with non-solid organ cancers were excluded. Moreover, patients who attended the emergency department and were not admitted were also excluded. COVID-19 diagnosis was established with a positive reverse transcriptase PCR test from an oropharyngeal swab. Criteria for admission to hospital and critical care were determined by individual clinical assessment and oxygen requirement as well as ventilatory support. Data security was maintained through the REDCap uploading system.
Binary logistic regression analyses with survival status at 30 days as the dependent variable were used to estimate the univariable association with mortality for each explanatory variable. Age-adjusted associations were calculated in a similar way by including age at admission as a continuous variable in each model after checking the assumption of a linear effect of age on the log odds. Both forward and backward stepwise methods were used to determine the final multivariable model. These analyses were performed with SPSS V.25.0.
Patient and public involvement
Patients and the public were not involved in the design and conduct of this study.
MLA: data preprocessing
R-Studio software was used to determine the association between cancer status, COVID-19 and 90-day survival against variables in an MLA. The conduct and reporting of our MLA was done in accordance with best practice guidance.11
The proportion of missing data was calculated for each variable, and variables with less than 40% missing data were included in the analysis. This resulted in 33 variables being included for imputation of missing data, further preprocessing and model development. The decision to limit the proportion of missing data to 40% was an arbitrary one, based on a compromise between a limit high enough to enable the inclusion of as many available variables as possible and low enough to enable the use of more data to predict imputable missing values with the k-nearest neighbours algorithm.
All data within the gender variable were replaced with ‘F’ (female) and ‘M’ (male). The documented ethnicities were replaced with the three categories of ‘European’, ‘South Asian’ or ‘Afro-Caribbean’. The blood pressure information was split into systolic and diastolic pressures. A new dummy variable of ‘mean arterial pressure’ was derived from the estimate: (diastolic pressure+(pulse pressure/3)). The pure numerical values from the entered data for oxygen saturation were extracted. For example, 97% would be changed to 97. A dummy variable was created from the difference in time between the date of onset of symptoms and date of hospital admission. This time interval was recorded in days.
The overall data set was partitioned into training and test sets. The training set was used purely for model training and hyperparameter tuning. The test set would only be used for model evaluation against new data. Partitioning was by a random allocation, while ensuring an identical distribution of patients who died at 90 days between both training and test sets. Seventy-five per cent of patients were allocated to the training set, with the remaining going into the test set (online supplemental figure 1).
Imputation of missing values
Missing values were replaced with predicted values using k-nearest neighbours model. This method designated the variable of a missing value as an outcome variable within a predictive model. A prediction of the missing value on the most similar k number of patients was based on their other variables. The value k is a hyperparameter which was set to 10 after comparing the values of 5 and 10 without any difference.
This imputation was performed separately on the training and test data sets in order to minimise overfitting of the final model by having the training data set influence the imputation of values into the test data set.
All numerical variables within the training set were preprocessed for model training to be on comparable scales ranging mainly from 0 to 1. For each such variable, the mean was subtracted from each value before dividing the result by the SD.
The same process was applied to the test set, using the means and SDs from the training set to avoid overfitting.
The following models were trained using 10-fold cross-validation:
Lasso and elastic-net generalised linear model.
Neural network with one hidden layer.
Gradient boosted machine.
Hyperparameter tuning during cross-validation was optimised against area under the receiver operating curve as a metric. The random forest model was built with 500 trees.
Predictions of probabilities of survival to 90 days were made on the test set by each of the five trained models. The known survival outcomes to 90 days and predicted probabilities from each model were used to plot receiver operating curves for model for comparison.
In total, 22 729 patients with active cancer were identified in the Dudley West Midlands region out of a catchment size of 426 658 patients in the region from the local cancer network. Eighty-seven of 22 729 (0.4%) patients with cancer in the Dudley region were admitted with COVID-19 compared with 526 of 404 379 (0.1%) during the study period (HR: OR of being hospitalised with COVID-19 if having cancer is 2.942671 (95% CI: 2.344522 to 3.693425); p<0.001). The types of cancer in our cohort are detailed in figure 1. Thus, the risk of hospital admission on presentation with COVID-19 increased threefold in the presence of an active cancer diagnosis.
Excluding those with incomplete data, the mean age of patients with cancer was 77.8 (SD=12.3) years compared with 70 (SD=17.5) years (t-test; p<0.001). The male:female ratio was similar between the two groups. The majority of patients were of Caucasian ethnicity with similar distribution of diabetes, cardiovascular disease, transplant recipient and smoking status. Moreover, the median white cell count (p=0.096) and C reactive protein (CRP) (p=0.115) were similar between patients with cancer and those without cancer with no statistically significant difference. Thus, both cancer and non-cancer groups affected by COVID-19 had similar baseline characteristics. This is summarised in table 1.
A Χ2 test, comparing patients without cancer not hospitalised with COVID-19 (404 379) yields a p value of <2.2e-16, implying that there is an association between having cancer and hospitalisation with COVID-19 (table 2). The OR of being hospitalised with COVID-19 if having cancer is 2.942671 (95% CI: 2.344522 to 3.693425).
After training and hyperparameter tuning by 10-fold cross-validation, predictions of probability of 90-day survival were made on the test set data. This is shown in the receiver operating curves plotted for model comparison (figure 2).
Since we accepted variables with up to 40% missing values (online supplemental figure 2), imputation was performed using a separate k-nearest neighbours algorithm, whereby a prediction of a missing value was made based on other available values, having been trained on the other patient data.
Our initial age-adjusted univariate analysis identified age, CRP, urea, creatinine, estimated glomerular filtration rate (GFR), haemoglobin and low initial blood pressure as significantly correlating with mortality risk (online supplemental table 1). A further multivariate analysis of 33 out of 213 clinical variables with >60% data completeness showed increased age (HR 0.915 (95% CI: 0.870 to 0.960), p<0.001), urea (HR 1.005 (95% CI: 1.002 to 1.007), p<0.001) and CRP (HR 1.065 (95% CI: 1.016 to 1.116), p<0.001) to be associated with greater risk of 30-day and 90-day mortality (table 3).
Kaplan-Meier survival analysis revealed reduced overall survival for patients with COVID-19 and cancer (figure 4). However, log-rank analysis did not show significant difference between patients with COVID-19 with cancer and those without cancer (log-rank p=0.172).
Our study demonstrates that the presence of active cancer increased by threefold the risk of hospitalisation with COVID-19. Moreover, higher CRP and urea are associated with greater mortality at 30 and 90 days post-diagnosis of COVID-19. These findings show that patients with cancer who develop COVID-19 are likely to have a more severe form of the infection that would require supportive care in hospital. It also provides tools for monitoring patient response to treatment with high urea and CRP being poor prognostic markers and a likely consequence of severe COVID-19. This has implications for how we can deliver safe care to patients with cancer in the ongoing pandemic as well as emerging from it given the restrictions on cancer services.
Several studies have reported prevalence and mortality risk of COVID-19 in patients with cancer with a systematic review by Zarifkar et al identifying 110 studies covering 10 countries.19 The pooled prevalence of active cancer in COVID-19-positive hospitalised patients was 2.6% (95% CI: 1.8% to 3.5%) across 37 cohort studies. Furthermore, there was a noticeable difference in the prevalence between western countries (5.6%, 95% CI: 4.5% to 6.7%) and China (1.7%, 95% CI: 1.3% to 2.3%) reflecting the underlying cancer prevalence. In addition, in-hospital mortality of 14.1% (95% CI: 9.1% to 19.8%) for cancer and COVID-19 was derived from 17 retrospective cohort studies covering 904 patients.19 The mortality rate of 12.6% in a Brazilian cohort was also similarly reported.20 This indicated that patients with COVID-19 with cancer had a fivefold greater risk of death compared with patients without cancer without other comorbidities.19 21 However, there was significant heterogeneity between these studies (I2=55.9%, p<0.01) with the type of cancer, stage and treatment regimen only specified in eight studies along with incomplete follow-up. Furthermore, Liang et al reported a 28% prevalence of lung cancer among hospitalised patients with cancer with COVID-19.9 This reflects higher COVID-19 mortality rates in patients with specific cancer including lung and haematological malignancy.9 19 20 Further studies have reported 3.5-fold increase in intensive care unit (ICU) admission or need for mechanical ventilation in patients with COVID-19 with cancer.9 More recent studies have also examined the impact of COVID-19 on patients with cancer. In an analysis of 306 patients with COVID-19 with cancer, Russell et al identified factors including male gender, age greater than 60 years, Asian ethnicity, cancer diagnosis of greater than 2 years, haematological malignancy and a high CRP associated with increased mortality risk.22 A large population-based study by Lee et al comparing 23 266 patients with cancer with 1 784 293 patients without cancer identified a 60% increased risk of COVID-19 in patients with cancer, with those on chemotherapy or immunotherapy having a 2.2-fold increased risk of contracting COVID-19.23 This increased susceptibility could be explained through immune compromise of simply greater exposure through more frequent hospital visits. Even in this large study, subgroup analysis was not performed evaluating the impact of tumour type, stage and treatment regimens. Furthermore, in a multicentre study, comparing UK (n=468) and European Union (EU) (n=924) patients with cancer with COVID-19, showed a worse mortality rate at 30 days and 6 months independent of age, gender, tumour stage and treatment through a multivariable regression model.24 Moreover, Mehta et al showed increased risk of COVID-19 in 218 patients with cancer in New York which was associated with age, comorbidities and elevated lactate dehydrogenase (LDH).25 Although few studies have indicated no increased risk of severity or mortality from COVID-19 in patients with cancer,26 27 the consensus thus far in the literature has coalesced around the idea that patients with cancer in general have a higher risk of susceptibility from severe events and mortality from COVID-19 infection. Using admission risk as a surrogate marker of severity in COVID-19, our results are consistent with the literature showing a threefold higher risk of admission with COVID-19 in the presence of cancer which will likely impact the delivery of care to these particular subgroups of patients. However, the majority of current data are from retrospective cohort studies, using traditional statistical techniques on selected limited variables with a relatively small number of participants. Moreover, for this reason, subgroup analysis has been difficult. Since cancer is a diverse condition from the clinical to genomic spheres, with an equally diverse range of treatments, considering it as a monolithic structure would not let meaningful conclusions to be drawn from such analyses. Having a multicentre approach and application of novel big-data analysis techniques such as MLA may enable a more reliable and rapid analysis of data to discover associations in time-critical situations such as delivering healthcare in a global pandemic.
There is likely to be a surge in demand for cancer services as well as predicted poor long-term survival in patients with cancer due to delays in diagnosis and treatment.11 28 Over the first national UK lockdown, there was a 84% reduction in urgent cancer referrals which modelling predicted would lead to 181 additional lives lost or 3316 life-years lost with an average presentation delay of 2 months per patient.29 Although having cancer puts a patient at increased risk of hospitalisation with COVID-19, this must be balanced against risks of delayed treatment leading to disease progression to incurable stages.30 Particular cancers where timely intervention is critical such as pancreatic, lung and haematological malignancy should not have delays to treatment, whereas others including prostate and non-melanoma skin cancer treatment may be safely delayed in selected patients.30 Several strategies including: delays to surgery or chemotherapy, switching to oral or monotherapy treatment regimens, strict infection control protocols, online consultation, use of hypofractionated radiotherapy and provision of intensive care support to these patients are essential to mitigate risk.10 30–32 This may be supplemented where possible with COVID-19 free ‘cold’ sites to reduce risk of transmission and prevent anti-cancer treatment-induced COVID-19.32 Thus, categorisation of patients according to risk, minimising patient exposure and considering alternative regimens to control cancer forms the basis of current recommendations including the European Society for Medical Oncology (ESMO) expert consensus and UK National Health Service (NHS) guidelines.4 32 Furthermore, these data do not support delays in cancer treatment to reduce risk of COVID-19 transmission in patients with cancer.
Several biochemical markers have been associated with a severe COVID-19 disease course. Zhou et al identified a raised D-dimer above 1 µg/mL to be associated with a higher mortality risk.33 Furthermore, they identified low albumin, raised LDH, troponin, ferritin and interleukin-6 to be more prevalent in non-survivors. In addition, raised CRP and low GFR were associated with a more severe disease outcome with 18% deaths recorded in a renal transplant cohort in keeping with the severe disease course predicted in this group of immunocompromised patients.34 Our model identified raised urea and CRP in addition to transplant status as predictors of greater mortality risk which may lower threshold for admission or earlier referral for intensive care support. However, our algorithm could not specify the direction or size of this interaction which is a limitation of such models.
MLAs are increasingly being used to support healthcare applications including cancer diagnosis, outcomes and recurrence.16 17 35 MLA can be used to learn from established data sets and identify hidden patterns between a large number of variables to support individualised decision-making.36 Nonetheless, the technique requires training data sets, appropriately selected analysis method, as well as testing data sets to establish internal and external validity.16 35 MLAs have been shown to improve the accuracy of predicting cancer susceptibility, recurrence and mortality by 15%–25%.37 Moreover, we have shown through our modelling that the findings of both MLA and traditional statistical analysis are complementary and may be used to generate a risk prediction scoring system in patients with cancer with COVID-19.
However, there are several limitations in the method and data presented. MLA remains an experimental technique and still very dependent on the quality of input data. Issues including noise, bias, outliers, missing or duplicate data can lead to misclassifications in any risk prediction model which may be mitigated by larger data sets.15 As such, few MLAs have achieved validation or widespread clinical application. In our study, confounders including smoking status or respiratory comorbidity were not assessed which could influence outcomes in patients with cancer. Patients with active cancer who tested positive for COVID-19 in the community but did not require hospital admission could not be evaluated. Having a broad inclusion criterion with all solid organ cancers while beneficial for looking at overall impact on patients with cancer does not capture the granularity of how individual cancers may differ in their impact on patients with COVID-19. For example, haematological malignancy, lung cancer and metastatic disease were associated with adverse outcomes from COVID-19 infection.38 Our data set was underpowered to perform relevant subgroup analyses on these patients. Although all patients with active cancer were analysed, variation in the stage of cancer and treatment protocols were not accounted. Thus, the MLAs are limited by the quality of data input and rely on imputation as part of model development which needs external validation once developed, which we have not performed. Nonetheless, this study provides proof of concept to investigate this question in a collaborative manner using larger data sets.
COVID-19 has impacted both individuals and healthcare systems in an enormous way. How we deliver safe and effective care to these patients in the confines of our healthcare systems is predicated on identifying those most as risk from this disease. MLAs provide an additional tool for risk assessment to delineate factors with poor prognosis. This will enable us to reconfigure our healthcare systems to provide safe care to these more vulnerable patients.
Data availability statement
All data relevant to the study are included in the article or uploaded as supplemental information.
Patient consent for publication
This study involves human participants and was approved by Russells Hall Ethics Committee (REC Reference 20/EE/0139; IRAS ID28233). Participants gave informed consent to participate in the study before taking part.
Contributors AA—conceptualisation, protocol development, ethics approval, proofreading, manuscript writing and approval of final manuscript. FM—data analysis, manuscript drafting and proofreading. NA—data analysis, machine learning algorithm and figures. MR—conceptualisation and data collection. PN—conventional statistical analysis. OOO—conceptualisation, reading and correcting manuscript drafts, and approval of final manuscript. AA and OOO are responsible for overall content and serve as guarantors.
Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors. However, OOO is supported by the National Cancer Institute (grant # CA221704).
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.