White cell count in the normal range and short-term and long-term mortality: international comparisons of electronic health record cohorts in England and New Zealand

Objectives Electronic health records offer the opportunity to discover new clinical implications for established blood tests, but international comparisons have been lacking. We tested the association of total white cell count (WBC) with all-cause mortality in England and New Zealand. Setting Primary care practices in England (ClinicAl research using LInked Bespoke studies and Electronic health Records (CALIBER)) and New Zealand (PREDICT). Design Analysis of linked electronic health record data sets: CALIBER (primary care, hospitalisation, mortality and acute coronary syndrome registry) and PREDICT (cardiovascular risk assessments in primary care, hospitalisations, mortality, dispensed medication and laboratory results). Participants People aged 30–75 years with no prior cardiovascular disease (CALIBER: N=686 475, 92.0% white; PREDICT: N=194 513, 53.5% European, 14.7% Pacific, 13.4% Maori), followed until death, transfer out of practice (in CALIBER) or study end. Primary outcome measure HRs for mortality were estimated using Cox models adjusted for age, sex, smoking, diabetes, systolic blood pressure, ethnicity and total:high-density lipoprotein (HDL) cholesterol ratio. Results We found ‘J’-shaped associations between WBC and mortality; the second quintile was associated with lowest risk in both cohorts. High WBC within the reference range (8.65–10.05×109/L) was associated with significantly increased mortality compared to the middle quintile (6.25–7.25×109/L); adjusted HR 1.51 (95% CI 1.43 to 1.59) in CALIBER and 1.33 (95% CI 1.06 to 1.65) in PREDICT. WBC outside the reference range was associated with even greater mortality. The association was stronger over the first 6 months of follow-up, but similar across ethnic groups. Conclusions Clinically recorded WBC within the range considered ‘normal’ is associated with mortality in ethnically different populations from two countries, particularly within the first 6 months. Large-scale international comparisons of electronic health record cohorts might yield new insights from widely performed clinical tests. Trial registration number NCT02014610.

Total white cell counts can be affected by many factors such as infections, autoimmune diseases, medication and haematological conditions. Similar to our recent study on eosinophil counts, 1 we sought to differentiate between a patient's long-term 'stable' total white cell count and results obtained when the patient had an 'acute' condition which may alter leukocyte counts. We used other information in the electronic health record (prescriptions, diagnoses, symptoms, hospitalisations) to assess whether the patient was clinically 'acute' or 'stable' at the time of the blood test, adapting a set of criteria proposed by the eMERGE consortium (electronic Medical Records and Genomics) 2 for studying genetic determinants of the stable leukocyte counts: in hospital on the date of blood test, vaccination in the previous 7 days, anaemia diagnosis within the previous 30 days, symptoms or diagnosis of infection within the previous 30 days, prior diagnosis of myelodysplastic syndrome, prior diagnosis of haemoglobinopathy, cancer chemotherapy or G-CSF within 6 months before index date, the use of drugs affecting the immune system such as methotrexate or steroids within the previous 3 months, prior diagnosis of HIV infection, prior splenectomy or prior dialysis.

Survival analysis
The main analysis used a Cox proportional hazards approach 3 to model the association between white cell count quintiles and time to death, adjusting for the following Framingham risk factors, chosen a priori: age, sex, total: HDL cholesterol ratio, systolic blood pressure, diabetes status, and current smoking status. As it was not possible to combine the raw datasets, we used quintiles based on the CALIBER dataset for both CALIBER and PREDICT analyses. In CALIBER, the baseline hazard of the Cox model was stratified by general practice.
All analyses were done using R software (version 3.0.2), 4 using the survival 5 package for Cox regression. Restricted cubic splines were used to investigate the relationship between continuous variables and time-to-death as a check on the modelling assumption of linearity. We assessed the proportional hazards assumption by plotting the scaled Schoenfeld residuals against log time. 3 We found that the hazard ratio varied over time (Supplementary Figure S7), so we split the follow-up time into two intervals, pre and post 6 months.

Multiple imputation: CALIBER
In CALIBER, the cohort consisted of individuals with a white cell count record and no prior history of cardiovascular disease, whether or not they had had a formal cardiovascular disease assessment. Total white cell count data was therefore completely observed but some people did not have records of other baseline covariates: total cholesterol, HDL cholesterol, blood pressure or smoking status. We wished to examine for interactions between total white cell count and predictor variables in their association with mortality, so we chose a method of multiple imputation which would account for these interactions. Therefore for the primary analysis we used multiple imputation using chained equations (MICE) 6 with Random Forest multiple imputation. 7 Random Forest is a machine learning regression method which can automatically account for nonlinearities in the associations between predictor variables, and can handle large numbers of variables without encountering problems due to collinearity. 8 It has been shown to perform well on test datasets derived from CALIBER, 8 but can be slow on very large datasets, so we carried out the imputation within general practice, thus also accounting for between-practice variability. As Random Forest multiple imputation is a fairly new method we also performed a secondary analysis using normalbased multiple imputation in MICE. With normal-based MICE we included the general practice as a fixed effect (i.e. a categorical variable) and carried out the imputation separately by age group and sex, which accounted for interactions between these variables without requiring too many additional parameters in the imputation models.
Imputation models included all the variables in the substantive model, event indicator and time in the form of the marginal Nelson-Aalen cumulative hazard, as well as the following auxiliary variables: counts of white cell subtypes (neutrophils, basophils, eosinophils, monocytes and lymphocytes), renal function (estimated glomerular filtration rate), and diastolic blood pressure. We used 10 iterations and generated 10 imputations. We verified that the number of iterations was sufficient by inspecting plots of chain means and variances. We combined the results of analysis using Rubin's rules.
The two methods of multiple imputation used in CALIBER data (normal-based and Random Forest MICE) yielded almost identical estimates (Supplementary Figure S 8).

Multiple imputation: PREDICT
The baseline population in PREDICT consisted of all individuals undergoing cardiovascular risk assessment in primary care. Cardiovascular risk factors were almost completely observed but about 30% of people did not have a record of total white cell count. As individuals with a measurement of white cell count would be a (possibly biased) sample, we carried out analyses both limited to those with a white cell count measurement and among the entire baseline population, using multiple imputation to handle missing values. We imputed missing values using Random Forest. Imputation models included all the variables in the substantive model, event indicator and time as the marginal Nelson-Aalen cumulative hazard. 9 We generated 10 imputed datasets and combined the results of analysis using Rubin's rules.
Complete case analysis of PREDICT data yielded similar estimates to the main imputed analysis (Supplementary Figure S 9). Supplementary Table S1.

Data sources
Full general practice record All coded data. Some free text available.

Supplementary Figure S1. Scaled Schoenfeld residuals for multiply adjusted hazard ratio for mortality comparing quintiles of total white cell count: CALIBER
Hazard ratios were adjusted for age, sex, smoking, diabetes, systolic blood pressure, ethnicity and total: HDL cholesterol ratio.

Supplementary Figure S2. Multiply adjusted hazard ratios for all cause mortality by category of total white cell count, by age group: CALIBER
Hazard ratios were adjusted for age, sex, smoking, diabetes, systolic blood pressure, ethnicity and total: HDL cholesterol ratio. P values * < 0.05, ** < 0.01, *** < 0.001

Supplementary Figure S5. Multiply adjusted hazard ratios for all cause mortality by category of total white cell count, by sex: CALIBER
Hazard ratios were adjusted for age, sex, smoking, diabetes, systolic blood pressure, ethnicity and total: HDL cholesterol ratio. P values * < 0.05, ** < 0.01, *** < 0.001

Supplementary Figure S7. Multiply adjusted hazard ratios for all cause mortality by category of total white cell count, by ethnicity: PREDICT
Hazard ratios were adjusted for age, sex, smoking, diabetes, systolic blood pressure, ethnicity and total: HDL cholesterol ratio. P values * < 0.05, ** < 0.01, *** < 0.001. 'Other' group not shown because most categories had too few events for the calculation of hazard ratios