Article Text


Enhancing risk stratification for use in integrated care: a cluster analysis of high-risk patients in a retrospective cohort study
  1. Sabine I Vuik1,
  2. Erik Mayer2,
  3. Ara Darzi1,2
  1. 1Institute of Global Health Innovation, Imperial College, St Mary's Hospital, London, UK
  2. 2Department of Surgery, Imperial College, St Mary's Hospital, London, UK
  1. Correspondence to Sabine I Vuik; s.vuik{at}


Objective To show how segmentation can enhance risk stratification tools for integrated care, by providing insight into different care usage patterns within the high-risk population.

Design A retrospective cohort study. A risk score was calculated for each person using a logistic regression, which was then used to select the top 5% high-risk individuals. This population was segmented based on the usage of different care settings using a k-means cluster analysis. Data from 2008 to 2011 were used to create the risk score and segments, while 2012 data were used to understand the predictive abilities of the models.

Setting and participants Data were collected from administrative data sets covering primary and secondary care for a random sample of 300 000 English patients.

Main measures The high-risk population was segmented based on their usage of 4 different care settings: emergency acute care, elective acute care, outpatient care and GP care.

Results While the risk strata predicted care usage at a high level, within the high-risk population, usage varied significantly. 4 different groups of high-risk patients could be identified. These 4 segments had distinct usage patterns across care settings, reflecting different levels and types of care needs. The 2008–2011 usage patterns of the 4 segments were consistent with the 2012 patterns.

Discussion Cluster analyses revealed that the high-risk population is not homogeneous, as there exist 4 groups of patients with different needs across the care continuum. Since the patterns were predictive of future care use, they can be used to develop integrated care programmes tailored to these different groups.

Conclusions Usage-based segmentation augments risk stratification by identifying patient groups with different care needs, around which integrated care programmes can be designed.

Statistics from

Strengths and limitations of this study

  • This study uses a large data set containing patient-level linked primary and secondary care administrative data.

  • Rather than focusing only on emergency care, this study looks at patterns of usage across different care settings to support the development of integrated care programmes.

  • Where previous studies have focused on how to identify or manage high-risk patients, this study explores the different patient groups within the high-risk stratum.

  • The data used were for a random sample of English patients, and may not reflect local trends.

  • No data were available in linked format for other care settings, such as accident and emergency, mental health, community and social care.


In healthcare, a small number of patients accounts for a disproportionally large share of usage.1 ,2 Risk stratification can be applied to identify and target this group. Risk stratification divides a population based on different levels of risk of a specific outcome, and is often presented as a core process to achieve integrated, personalised care.3–5 For each stratum, a tailored care model can be developed which addresses the specific needs of the patients. Many of the interventions for high-risk patients are primary care-led integrated care programmes, like virtual wards, case management and enhanced services and access.4 ,6–11

Risk stratification methods often focus on predicting emergency hospitalisations.3 ,12–15 Unplanned hospitalisations, including readmissions, are chosen because they are costly for a health system, may indicate low quality care and have a negative impact on patient experience.16 ,17 As such, unplanned hospitalisations are reflective of all elements of the triple aim of healthcare—quality of care, patient experience and cost18—and can be considered a ‘triple fail event’.16 Moreover, since preventing emergency hospitalisations to the acute setting requires effective primary care, they are also an important metric for integrated care.19

However, risk stratification based on emergency hospitalisations has important limitations. First, this approach only looks at one element of care. While the risk of an emergency hospitalisation can be expected to correlate with the overall use of emergency acute care, usage of other care services may vary. A patient with an emergency hospitalisation may be under treatment with a specialist; or regularly visit a general practitioner (GP); or not access ambulatory care at all. In order to design effective integrated care programmes that link up the appropriate care providers, understanding care use across all settings is crucial.

Second, detailed information on the characteristics of the high-risk patients, such as age, morbidities and socioeconomic status, is lost in the final risk score. All patients who end up in the top stratum have high risk scores, but the factors driving this high score can be very different. When developing interventions, these should be taken into account to understand which patients are most likely to respond to different interventions.12 ,20

The aim of this study is to show how usage-based segmentation can enhance risk stratification tools used for integrated care by, first, taking into account care usage across multiple care settings and, second, providing insight into the characteristics of different patient groups within the high-risk stratum.


Study design

To show how segmentation can augment risk stratification, we applied both methods to a large patient database. We first trained a risk prediction model to generate risk scores for each patient. Based on these risk scores, we identified the high-risk patient population. In this group, we applied a cluster analysis to a range of different usage variables. The different clusters were analysed and profiled to understand the different patient types that exist within a high-risk group.

The analyses were conducted for hypothetical ‘historical’ (2008–2011) and ‘future’ (2012) data sets. The historical data set reflects the information that would be available to healthcare professionals conducting risk stratification and cluster analysis at the end of 2011, while the future data set was used to understand how accurately the models predicted actual usage in the following year.


STATA (V.14) (Stata Statistical Software: Release 14. [program]. College Station, Texas: StataCorp LP, 2015) was used to perform the cluster analyses and calculate the pseudo-F statistics. For all other analyses, including the risk prediction, SPSS (V.23) (IBM SPSS Statistics for Macintosh, Version 23.0 [program]. Armonk, New York: IBM Corp, 2015) was used.


A data set covering primary and secondary care use for a random sample of 300 000 English patients was constructed from Clinical Practice Research Datalink (CPRD) and Hospital Episode Statistics (HES) data (CPRD ISAC approval under protocol 14_211R). Patients were eligible for inclusion if they were registered with a CPRD-participating GP practice during the entire study period of 2008 up to and including 2012, and if their HES records could be linked to CPRD. Other than those two criteria, the sample was entirely random. The CPRD data set is broadly representative of the age, sex and ethnicity composition of the UK population.21 In England, Clinical Commissioning Groups (CCG) are responsible for the planning and commissioning of care for local populations. The sample size in this study was set at 300 000, which is similar to the population of a CCG in the 75th centile,22 to reflect a typical local population in England.

The final data set included patient demographics, long-term condition (LTC) diagnoses and usage variables. We selected four high-level usage variables for the cluster analysis of high-risk patients: inpatient emergency hospitalisations, inpatient non-emergency hospitalisations, outpatient attendances and GP visits. These usage variables were used to reflect different care settings that may be incorporated in integrated care models. For the cluster analysis, the usage variables were log-normalised and standardised to reduce the impact of outliers and give equal weight to each variable.

Risk stratification

We calculated our own risk prediction score, reflecting predictor variables used in Patients at Risk of Re-hospitalisation (PARR) tool, the Combined Predictive Model and other commonly used risk prediction algorithms. The risk model was trained to predict emergency hospitalisations in 2012, using a stepwise logistic regression.14 ,23 The number of emergency hospitalisations in 2011 was included as one of the predictor variables, as well as a range of other variables used in previous risk models,13–15 ,24 as detailed in online supplementary appendix 1. The logistic regression on the training set excluded a number of diagnosis variables after stepwise elimination, as well as the 75+ flag.

To validate the model, a split sample validation method was used. Using the random sample function of SPSS, half of the sample was defined as the training set and the other half as the test set. Applying the risk model to the test set, the area under the receiver operator curve (ROC) was 0.75. This is in line with other models predicting emergency hospitalisations, which range from 0.55 to 0.83.13 ,24 The test population was stratified into three groups, which comprised the top 5% highest risk patients (‘High risk’), the top 5–20% (‘Medium risk’) and the remaining 80% of the population (‘Low risk’), in accordance with general risk stratification practice.2 ,15 ,17


For the segmentation analysis, the k-means algorithm was used to cluster the patients based on their historical usage. This method was selected as it is efficient and produces roughly similar-sized segments.25 Clustering solutions ranging from 2 to 8 clusters were explored for the high-risk stratum. To identify the optimal number of clusters, the pseudo-F statistic was calculated for all the clustering solutions using STATA. This statistic is commonly used in healthcare clustering studies,26–30 and is one of the best criteria to determine the number of clusters.31 It compares the between-cluster with the within-cluster sum-of-squares, and a large pseudo-F statistic indicates distinct clusters.32 In addition, the different clustering solutions were also explored using Ward's linkage clustering and post hoc analysis, as detailed in online supplementary appendix 2. The k-means and Ward's clustering analyses used the Euclidian distance measure.

The clusters were evaluated based on their validity, through statistical test confirming the differences between clusters, and their stability, by comparing future care usage of each cluster to the historical pattern.


To create profiles for the segments, the usage variables as well as demographic characteristics were analysed to see if they differed significantly across segments. For the non-normal usage and LTCs count variables, a Kruskal-Wallis test was used. For the continuous age and risk score variables, an ANOVA test was used, and for the binary morbidity variables and the 2012 emergency hospitalisation flag, a χ2 test. Where these tests found significant variation across segments, the results were then explored pairwise between segments to identify which segment or segments were significantly different from others. For this, the Mann-Whitney U tests, Student's t-tests and z-tests were used, respectively. To account for the multiplicity problem that occurs when performing multiple tests, the Bonferroni method was used to adjust the significance level.33–35


The final data set contained 298 111 people with a complete record across the variables, of which 149 320 observations were allocated to the test set used for the analyses below. When the population was stratified based on risk, predictive variables such as age, LTCs and historical care usage were all found to increase with each risk stratum (see table 1). In addition to historical usage, future usage of all care types also increased for the high-risk stratum.

Table 1

Strata characteristics

For the high-risk population, k-means cluster analyses were performed for 2–8 clusters and the pseudo-F statistics was obtained for each solution. A peak was observed around the 3-cluster and 4-cluster solutions. Exploring these two sets of clusters, the 4-cluster solution included an additional, contrasting usage pattern and was therefore selected.

The cluster analysis aims to optimise the distance between groups for the clustering variables, and statistical tests confirm that historical usage is significantly different across segments (see table 2). In addition, non-clustering variables, including future usage, age, number of LTCs and most disease prevalence variables, also differ significantly across the clusters.

Table 2

Clusters within the high-risk population

The clusters demonstrate a great variation in future care usage within the high-risk stratum (see figure 1). Emergency care usage, which defines high-risk patients, is high for all clusters. Nevertheless, clusters 1 and 3 have emergency care usage rates that lie closer to the medium risk stratum than the high-risk average. Non-emergency hospitalisations and outpatient attendances for clusters 3 and 4 are at or even below the medium risk rate. GP care on the other hand is more homogenous, with the rates for each cluster close to the high-risk average.

Figure 1

Mean future care usage for the risk strata—high (H), medium (M) and low (L)—and the four high-risk clusters—1, 2, 3 and 4.

While for each care setting, there exist high and low usage clusters, they are not consistently the same clusters. Each cluster has a unique pattern of usage rates (see figure 2). Cluster 1 has high usage across most care types, with the exception of emergency care. Cluster 4 has the opposite pattern, with high emergency care use but low usage of other care types. Clusters 2 and 3 have high and low usages across all settings, respectively. The differences between the clusters are strongest for historical care usage, on which the cluster analysis is based. However, each cluster exhibits the same pattern of usage in 2012.

Figure 2

Patterns of usage for the four high-risk clusters—emergency care hospitalisations (Emg), non-emergency hospitalisations (NonE), outpatient attendances (OP) and general practitioner visits (GP) versus the high-risk population mean.


Principal findings

The low, medium and high risk strata broadly correlate with care usage. For all care settings, the high-risk stratum has the highest historical and future usage. However, this study shows that, within the high-risk stratum, there is significant variation in care needs across the care continuum. The high-risk group can be split into four segments with different care usage rates, characteristics and care priorities.

Comparing historical and future usage for the four clusters, similar patterns can be observed, indicating that cluster analysis of historical data can help predict future needs. However, future usage rates were closer to the group mean for all clusters and all care settings than historical rates. This can be at least partially explained by regression to the mean (RTM), which is known to affect care usage predictions.12 ,36 ,37 RTM describes the phenomenon where exceptionally high or low observations tend to be followed by less extreme observations in repeated measurements.38 This effect is compounded if participants are stratified based on baseline measurements, which is the case when patients are clustered based on their 2008–2011 usage.

Comparison to previous studies

This study shows that, while integrated care and case management initiatives often are indiscriminately aimed at high-risk patients, the actual needs of these patients vary widely. Many studies have discussed how best to identify,13 ,14 ,39 ,40 or care for,6 ,8 ,10 ,11 ,36 ,41 the high-risk population, but few have used data analysis to better understand different types of high-risk patients.

A major strength of this study is its reliance on data from primary and acute care, to create a more comprehensive picture of care needs. While some risk prediction models, such as the Combined Predictive Model, include usage of non-acute care settings as predictor variables,15 this detail is lost in the final risk score and the stratification. A usage-based segmentation analysis, as demonstrated in this study, can be used to bring out this detail.

Limitations and future research

While primary and secondary care data were used in this study to understand care needs across the continuum, the picture is still incomplete. No patient-level linked data were available on the usage of the Accident and Emergency (A&E) department, mental health, community and social care, and these were therefore left out of scope. This is an important limitation, as many initiatives will require integration of these settings. Future research should be performed using more extensive data sets where these are available.

Another limitation is that the population used in this study is a random sample of patients in England. In this specific sample, the LTC prevalence was relatively low. This could be attributable to the fact that conditions were identified based on coded diagnoses in the administrative data rather than from disease registries, but it could also be a characteristic of our sample. Local populations may see different sizes or types of segments within their risk strata. Moreover, this study uses a custom risk prediction algorithm. If providers are using a specific risk model, they are encouraged to replicate the analysis using their own population data and risk strata.

Implications for integrated care

Segmenting the high-risk stratum using cluster analysis can help tailor and target integrated care programmes. For example, cluster 1 uses relatively little emergency care, but has a high usage of non-emergency and outpatient care. Patients in this segment may not be the best target for primary care-led interventions aimed at reducing emergency hospitalisations, as their overall usage of emergency care is low and they may already be under management of a specialist.

Cluster 2 has the highest usage rates, the highest risk score and the most LTCs. Surprisingly, this segment is also the youngest of the 4, with an average age of 67. Overall, high care usage makes this cluster a worthwhile target for interventions aimed at reducing care use. As patients in this cluster have extensive care needs across different settings, they would likely benefit from care coordination and case management initiatives.

Cluster 3 is at 83 years the oldest segment. Despite their old age, disease prevalence among the patients in this cluster is generally lower. This is reflected in their lower than average care use across all settings. This segment shows that while interventions often focus on elderly patients,6 ,36 ,42 this population group does not necessarily have the highest care usage.

Cluster 4 has one of the highest usage rates for emergency care, combined with a lower use of all other care services. Even GP care, which varies little for the other clusters, is below average for this group. This could indicate a lack of preventive primary care: patients in this cluster have on average 1.7 LTCs, but their low usage of primary care could be causing complications which require emergency care. This would make cluster 4 a prime target for enhanced services and primary care-led interventions focused on preventing complications and emergency hospitalisations.

However, it is important to note that the above implications are theoretical and have not been confirmed in practice. Future research is needed to translate the theoretical concepts presented in this paper into actionable information, including effective interventions and implementation.


This paper shows that a high risk of emergency hospitalisation is not unequivocally linked to high overall care needs, or a particular pattern of care use across other care settings. While risk stratification based on emergency hospitalisation can predict general care usage rates, within the high-risk stratum, there exist four very different patient types. Cluster analysis can enhance risk stratification by identifying groups of high-risk patients with unique care patterns across the care continuum, around which integrated care programmes can be designed.


View Abstract


  • Twitter Follow Sabine Vuik @sabinevuik

  • Database This study is based on data from the Clinical Practice Research Datalink obtained under licence from the UK Medicines and Healthcare Products Regulatory Agency. However, the interpretation and conclusions contained in the study are those of the authors alone.

  • Contributors SIV designed the study, created the database, analysed the data and drafted and revised the paper. She is the guarantor. EM contributed to the design of the study, analysed the results and revised the draft paper. AD contributed to the design of the study and revised the draft paper. All have approved the final version for publication.

  • Funding This study was partially funded by the Sowerby eHealth Forum, sponsored by the Peter Sowerby Foundation.

  • Disclaimer The funder had no role in the study design or analysis, or in the drafting and submission of this paper. The researchers worked independent from the funders.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement Technical appendix available in online supplementary files and statistical code are available from the corresponding author.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.