Article Text


Machine-learning prediction of cancer survival: a retrospective study using electronic administrative records and a cancer registry
  1. Sunil Gupta1,
  2. Truyen Tran1,2,
  3. Wei Luo1,
  4. Dinh Phung1,
  5. Richard Lee Kennedy3,
  6. Adam Broad4,
  7. David Campbell4,
  8. David Kipp4,
  9. Madhu Singh4,
  10. Mustafa Khasraw3,4,
  11. Leigh Matheson5,
  12. David M Ashley3,4,5,
  13. Svetha Venkatesh1
  1. 1Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Victoria, Australia
  2. 2Department of Computing, Curtin University, Perth, Western Australia, Australia
  3. 3School of Medicine, Deakin University, Geelong, Victoria, Australia
  4. 4Andrew Love Cancer Centre, Barwon Health, Geelong, Victoria, Australia
  5. 5Barwon Southwest Integrated Cancer Service, Geelong, Victoria, Australia
  1. Correspondence to Professor Svetha Venkatesh; svetha.venkatesh{at}


Objectives Using the prediction of cancer outcome as a model, we have tested the hypothesis that through analysing routinely collected digital data contained in an electronic administrative record (EAR), using machine-learning techniques, we could enhance conventional methods in predicting clinical outcomes.

Setting A regional cancer centre in Australia.

Participants Disease-specific data from a purpose-built cancer registry (Evaluation of Cancer Outcomes (ECO)) from 869 patients were used to predict survival at 6, 12 and 24 months. The model was validated with data from a further 94 patients, and results compared to the assessment of five specialist oncologists. Machine-learning prediction using ECO data was compared with that using EAR and a model combining ECO and EAR data.

Primary and secondary outcome measures Survival prediction accuracy in terms of the area under the receiver operating characteristic curve (AUC).

Results The ECO model yielded AUCs of 0.87 (95% CI 0.848 to 0.890) at 6 months, 0.796 (95% CI 0.774 to 0.823) at 12 months and 0.764 (95% CI 0.737 to 0.789) at 24 months. Each was slightly better than the performance of the clinician panel. The model performed consistently across a range of cancers, including rare cancers. Combining ECO and EAR data yielded better prediction than the ECO-based model (AUCs ranging from 0.757 to 0.997 for 6 months, AUCs from 0.689 to 0.988 for 12 months and AUCs from 0.713 to 0.973 for 24 months). The best prediction was for genitourinary, head and neck, lung, skin, and upper gastrointestinal tumours.

Conclusions Machine learning applied to information from a disease-specific (cancer) database and the EAR can be used to predict clinical outcomes. Importantly, the approach described made use of digital data that is already routinely collected but underexploited by clinical health systems.

Statistics from

Strengths and limitations of this study

  • This is the first study using machine learning of administrative and registry data for cancer survival prediction.

  • A single prognosis model is produced across all cancers, improving prediction accuracy on rare cancers.

  • This is a retrospective study in a single centre.


Over the past two decades, there has been an explosion in the use of digital footprints to monitor and predict human behaviours. The source of data used for this purpose is our online use of the internet, the emails we send and transactions we make. Analysis of these footprints through machine-learning techniques (MLT) has been exploited in the public domain by government and business to predict behaviours and inform investment decisions. In research, MLT have also been used to analyse gene expression data1 ,2 and for medical image analysis.3 ,4 However to date, there has been little exploration of these methodologies in the clinical setting. We hypothesised that MLT may offer a paradigm shift in clinical medicine that can address core issues with large and complex data sets. These techniques offer the potential to derive adaptive systems from diverse data sets, discover latent connections between data items and to predict outcomes.

Most hospitals routinely collect large digital electronic administrative records (EAR). These are primarily used for organisational financial management. Historically, they have not been used extensively for clinical or research purposes. If these large data sets are able to be exploited using MLT, it may open the way to optimise the use of collected administrative data to assist in predicting patients’ outcome, planning individualised patient care, monitoring resource utilisation and improving institutional performance.5 ,6 The accurate assessment of comorbid status would improve assessment of prognosis and guide treatment decisions.7–10 Other important information that may be contained or inferred from an EAR includes geographical and demographic data, socioeconomic status and history of healthcare facility utilisation.2 ,11 ,12

In this study, using cancer outcome prediction as a model, we wished to test the hypothesis that routinely collected digital health data, if analysed by state-of-the-art, validated, MLT could be used to assist conventional tools in predicting clinical outcomes.

Accurate prediction of survival in patients with cancer remains a challenge due to the ever-increasing heterogeneity and complexity of cancer, treatment options and patient populations. If achieved, reliable predictions could assist personalised care and treatment, and improve institutional performance in cancer management. In current practice, clinicians use data collected at the bedside in consultations, medical records or purpose-built cancer registries to aid prognostication and decision-making.

The notion of using MLT to predict cancer prognosis from clinical and pathological data is not a new one.13 ,14 However, with the advent of more sophisticated and better validated techniques, not only is more accurate prediction possible, but the range of data incorporated into decision aids can be increased.15–17 The need to improve cancer care systems by creating linkages between registries and epidemiological surveillance through analysis of complex and large clinical databases has recently been highlighted.18 ,19

In this study, we tested the capability of MLT to predict patient outcomes in a heterogeneous cohort of patients with cancer. We have interrogated two data sets:first, a purpose-built cancer-specific registry (Evaluation of Cancer Outcomes, ECO, from Victorian Cancer Outcomes Network in partnership with the Barwon South Western Regional Integrated Cancer Service) containing demographic and tumour-related data items according to an Australian nationally agreed protocol; and second, a hospital digital data set containing information about the patient's previous admissions and presentations (EAR). Finally, in a test group of 94 patients, we examined the performance of machine-learning methods in aiding a panel of expert clinicians in predicting patient survival.

Patients and methods

Study design

This is a retrospective study using the EAR and a specialised cancer registry (ECO) from Barwon Health, the only public tertiary institution in a region of Australia with more than 350 000 residents. With a unified hospital identity number in use across the region, Barwon Health's EAR provides a single point of access for information on patient encounters with the health system, including hospitalisations, ED visits, medications and treatments. In addition, the Andrew Love Cancer Centre at Barwon Health has a specialised cancer registry called ECO, which captures clinical data for patients in the region. ECO records information on demographics, primary tumour and metastatic tumour, cancer stage, tumour size, lymph nodes and breast tumour-specific information. Treatment type, outcomes, including death, and recurrence information (primary and metastatic) are also recorded. Box 1 shows the variables used for survival prediction. The cohort for this study consists of 963 patients identified in ECO who were first diagnosed in year 2009. The study completion date was 31 October 2012; therefore, all patients had at least 2 year and 10 months follow-up. Among these patients, 736 patients also had records in the EAR.

Box 1

Evaluation of Cancer Outcomes (ECO) variables used for survival prediction

Patient demographics

 Post code



Tumour characteristics

 Primary site (in International Classification of Diseases (ICD)-10 code)

 Tumour stream

 Morphology (in ICD-O-3 code)

 Histological grade

 Metastatic sites

 Most valid basis of diagnosis

 Performance status diagnosis

 Stage basis (pathological or clinical)

 Stage (TNM)

 Tumour size

 Nodes taken

 Positive nodes

Breast cancer related variables

 Oestrogen receptor

 Progesterone receptor

 Human epidermal growth factor receptor 2 (HER2)


The analyses centred on predicting cancer survival since the date of diagnosis, defined as the date of tumour resection. Each patient was a unit of observation in the predictive problem: patient data collected prior to the diagnosis date were used to construct the independent variables; survival status in a period following the assessment was the dependent variable. Two analyses were performed: the first compared survival prediction made by machine-learning models and the clinician panel, based on only information from ECO. The second analysis evaluated the added discriminative power provided by EAR, by comparing the best machine-learning models using three sets of predicting variables: variables from ECO (box 1), variables from EAR (see online supplementary appendix) and the union of the two.

Although a survival analysis model (eg, a proportional hazards model20) is commonly used in modelling risk factors, such models are not designed to predict events. In this study, survival was directly modelled using classification models to optimise prediction accuracy.

Comparing predictions by machine-learning models and clinician

In the first analysis, all 963 patients in the ECO registry were randomly divided into a derivation cohort of 869 patients and a validation cohort of 94 patients (table 1). To collect clinician prediction, patients in the validation cohort were assigned to a panel of five oncologists for survival prediction. For each patient, the oncologist was asked to estimate the survival probabilities based on the independent variables in box 1. All clinicians estimated the patient's survival status by producing a probability for each of the three time periods—6 months, 1 year and 2 years. When making this assessment, the clinicians did not have knowledge of the treatment type offered or given to the patient. Three machine-learning models were trained on the derivation cohort using the same set of independent variables, one for each prediction period. Each of the machine-learning models was an ensemble of 400 support vector machines (SVMs)21 with linear kernel (ie, the output of the model was the average of 400 SVM outputs in Platt's a posteriori probabilities22). Ensemble was used to control the variability introduced by L1 feature selection. Each of the SVMs was trained using a random 80% subsampling (without replacement) of the derivation cohort.23 The soft margin parameter (C) of SVM was selected through cross-validation. Two measures were taken to improve the training process. First, to compensate for the imbalance between the two outcomes (there were more survivals than deaths), we oversampled the non-surviving cases by 50% in each training subsample. Next, variable selection was performed through fitting a generalised linear model with elastic net regularisation24 (α parameter set to 0.1 and λ parameter selected using fivefold internal cross-validation), and variables with zero coefficients were removed. After the machine-learning models were constructed, they were applied to predict survival probabilities for each patient in the validation cohort. The clinician and model predictions were validated with the actual outcomes in the ECO registry. Prediction performance was measured using the area under the receiver operating characteristic curve (AUC), also known as the C-statistic,25 and 95% CIs of AUCs were computed using 1000 bootstrap samples of validation cohort.

Table 1

Characteristics of derivation and validation cohorts

Comparing discriminative information from specialised registry and routine data

The second analysis compared the discriminative power of two data sources (ECO and EAR). In this analysis, clinician predictions were not solicited. Among the 869 patients in the derivation subset of cohort 1, only 664 had records in the EAR and these patients were included in the second analysis (cohort 2, table 1). Survival prediction models were derived based on three sets of independent variables: (1) independent variables from EAR (EAR only); (2) independent variables from ECO (ECO only) and (3) the union of the two sets (EAR+ECO). Similar to the previous analysis, the models were trained using 400 random subsamples comprising 80% data of the cohort 2, and the modelling process was identical. However, the models were evaluated not using the validation cohort. Instead, for each 80% subsample, the remaining 20% was used to compute the AUC and its 95%CI.

The Wilcoxon rank-sum test was applied to answer the following comparison problems:

  1. Does ECO only provide more discriminative power than EAR only?

  2. Does EAR+ECO provide more discriminative power than EAR only?

  3. Does EAR+ECO provide more discriminative power than ECO only?

Details of the machine-learning model and the predictor variables can be found in the online supplementary appendix.


The cohorts for the two analyses are summarised in table 1. The comparison between the algorithmic predictions and the clinician predictions are summarised in table 2. The model had comparable performance to that of the clinicians, with the performance of the machine-learning model marginally better (AUC ranging from 0.76 to 0.87) than that of the clinicians (AUC ranging from 0.75 to 0.79) for all three prediction periods. This similarity in accuracy between algorithmic predictions and the clinician predictions was observed across different cancer types. Consider the predictions for 6-month survival. Of 15 breast cancer cases, the clinicians made 15 correct predictions and the algorithm made 14; of 18 lung cancer cases, the clinicians made 13 correct predictions and the algorithm made 14; of 7 haematological cases, the clinicians and the algorithm made all predictions correctly. Similar results were observed on 12-month and 24-month survival predictions for different cancers.

Table 2

Performance of survival prediction: comparison between machine-learning method and clinicians

Prediction of 6-month survival using the three models is shown in table 3. There were no deaths from breast cancer during this period. Comparing the ECO model with the EAR model, AUCs were comparable for colorectal, genitourinary, haematological, head and neck, and skin tumours. The EAR model was significantly better (p<0.05) for rare tumours, central nervous system (CNS), upper gastrointestinal and unassigned primary source tumours. For each tumour type, the model using ECO and EAR data yielded similar or better performance than the models using information from only one of the two databases. AUCs for the combined model ranged from 0.76 to 1.0. The combined data model showed particularly improved performance over ECO data (p < 0.05) for all tumour streams except breast and CNS tumours.

Table 3

Prediction performance of machine-learning algorithms: 6-month survival

Data for 12-month survival prediction is shown in table 4. Cancer-specific ECO data yielded better prediction than EAR data (p<0.05) for gynaecological, haematological, lung, skin and unknown primary cancers. Otherwise, ECO and EAR models yielded generally similar results. The model using combined data performed better than EAR (p<0.05) for all tumour streams other than CNS, head and neck and upper gastrointestinal tumours. The model using combined data was better than (p<0.05) ECO for all cancers except breast, CNS, gynaecological and haematological cancers.

Table 4

Prediction performance of machine-learning algorithms: 12-month survival

Table 5 shows data for 24-month survival prediction by the three models. The ECO model yielded superior prediction (p<0.05) to the EAR model for breast, genitourinary, gynaecological, lung, skin and unknown primary cancers, while the EAR model was superior to the ECO model for haematological and head and neck tumours. Once more, the model that performed the best was that derived from ECO and EAR data with AUCs ranging from 0.71 to 0.97 across the range of cancers and particularly enhanced performance for all cancers except breast, colorectal, gynaecological and unknown primary tumours compared with the ECO. In summary, over all time periods, the performance of the combined model was better than ECO (p<0.05) for genitourinary, head and neck, lung, skin, and upper gastrointestinal tumours.

Table 5

Prediction performance of machine-learning algorithms: 24-month survival

One of the key advantages of using MLT is that it can combine the large number of non-clinical factors with the few clinical risk factors. In this study, the model selected most of the known clinical risk factors including patient age, cancer staging, performance status and tumour size. In addition, it also found some useful non-clinical risk factors, including the type of the last hospital admission (emergency vs elective), the frequency of ED visits within the previous 3 and 6 months (related to cancer and other medical conditions).


In this study, using cancer outcome prediction as a model, we wished to test the hypothesis that routinely collected digital health data, if analysed by MLT, could be used to assist conventional tools in predicting clinical outcomes.

Applying machine learning to data from the EAR alone predicted clinical outcomes with reasonable accuracy. Using the purpose-built ECO data set, the predictive tool also performed well across a broad range of cancer types, and in both cases the predictive accuracies were at least as good as that of a panel of five expert clinicians. Importantly, a predictive tool derived from the purpose-built clinical registry and administrative data had even greater predictive ability.

The wealth of administrative data contained in the EAR includes information on comorbid conditions and previous clinic and hospital attendances as well as a drug history. There is considerable potential to use this data to improve clinical care across a spectrum of diseases.5 ,6

Most patients in the study were followed up for 3 years, which may not be adequate to capture all oncological outcomes, especially for those cancers with low mortality rate. We have designed this study as retrospective and in a single centre; it will be of major interest to observe how it performs in a variety of settings. The number of cases used to assess performance of the models is relatively small. The strengths include the comparison of machine-learning tools with expert clinical opinion and the fact that very detailed and well-validated data was available both directly related to the cancer and that contained in the EAR. The generic nature of this approach makes it unnecessary to generate separate predictive models for different types of cancer. This was a particular advantage for rarer forms of cancer where predications using more conventional methods are very challenging.

Predictive tools derived from clinical data items have considerable potential to improve clinical care, but must be suitably optimised and shown to perform equally well in diverse clinical settings.26 ,27 Clinical databases have become more widely available and increasingly complex in recent years. The extent and complexity of data available to clinicians means that novel approaches to managing data and supporting clinical decisions are needed. Machine-learning approaches can not only cope with complex data sets, but also adapt in real time and across different clinical settings.

The approach used in this study offers superior performance to previous machine-learning approaches in predicting cancer survival.13–17 Previous models have been derived for single cancer types, or for a limited range of cancers. The model described here performed well across a wide range of cancers. One advantage of this generic approach may be the ability to predict outcomes in less common cancers where limited data might preclude development of specific models. The fact that our model derived from administrative and cancer-related data performed slightly better than a panel of expert clinicians validates the potential utility of the model and suggests that it may be useful in assessing quality of care and also in settings where specialist care is not available. An alternative approach to borrow information across different cancer types is called multitask learning. We are currently exploring this approach as well.

Clinical outcomes in any illness are determined by specific factors related to the illness itself and also by the patient's general state of health and by the presence of other chronic medical conditions often coded in an EAR if the individual traffics the health service.7–10 As well, a particularly novel and important aspect of the use of historical data from the EAR in machine learning is that it effectively captures the healthcare institution’s current and previous performance. These data can be applied to any individual entering the system with a newly diagnosed cancer, as we have modelled here. As well, they could also be used for quality and performance monitoring.

In conclusion, machine learning applied to information from a disease-specific (cancer) database and the EAR can be used to predict outcomes. Improved prediction of outcome has the potential to help clinicians make more meaningful decisions about treatment and to assist with planning of future social and care needs. Most importantly, the approach described makes use of digital data that is already routinely collected but underexploited by clinical health systems.


View Abstract
  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributors All authors of this research paper have directly participated in the planning, execution or analysis of the study. All authors of this paper have read and approved the final version submitted.

  • Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None.

  • Ethics approval Ethics approval was obtained from the Hospital and Research Ethics Committee at Barwon Health (number 12/83). Deakin University has reciprocal ethics authorisation with Barwon Health.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.