Objectives The aim of this study was to identify, with soft clustering methods, multimorbidity patterns in the electronic health records of a population ≥65 years, and to analyse such patterns in accordance with the different prevalence cut-off points applied. Fuzzy cluster analysis allows individuals to be linked simultaneously to multiple clusters and is more consistent with clinical experience than other approaches frequently found in the literature.
Design A cross-sectional study was conducted based on data from electronic health records.
Setting 284 primary healthcare centres in Catalonia, Spain (2012).
Participants 916 619 eligible individuals were included (women: 57.7%).
Primary and secondary outcome measures We extracted data on demographics, International Classification of Diseases version 10 chronic diagnoses, prescribed drugs and socioeconomic status for patients aged ≥65. Following principal component analysis of categorical and continuous variables for dimensionality reduction, machine learning techniques were applied for the identification of disease clusters in a fuzzy c-means analysis. Sensitivity analyses, with different prevalence cut-off points for chronic diseases, were also conducted. Solutions were evaluated from clinical consistency and significance criteria.
Results Multimorbidity was present in 93.1%. Eight clusters were identified with a varying number of disease values: nervous and digestive; respiratory, circulatory and nervous; circulatory and digestive; mental, nervous and digestive, female dominant; mental, digestive and blood, female oldest-old dominant; nervous, musculoskeletal and circulatory, female dominant; genitourinary, mental and musculoskeletal, male dominant; and non-specified, youngest-old dominant. Nuclear diseases were identified for each cluster independently of the prevalence cut-off point considered.
Conclusions Multimorbidity patterns were obtained using fuzzy c-means cluster analysis. They are clinically meaningful clusters which support the development of tailored approaches to multimorbidity management and further research.
- chronic conditions
- cluster analysis
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
Studies focus on diseases rather than individuals as the unit of analysis in assessing multimorbidity patterns (hard clustering forces each individual to belong to a single cluster, whereas soft clustering allows elements to be simultaneously classified into multiple cluster).
Reliable and valid identification of disease clusters is needed for the development of evidence-based clinical practice guidelines and pathways of care for patients that correspond to the wide spectrum of diseases in patients with multimorbidity.
Soft clustering analysis allows for diseases to be linked simultaneously to multiple clusters and is more consistent with clinical experience than other approaches frequently found in the literature.
The different cut-off points (prevalence filters) applied to obtain multimorbidity patterns permitted the identification of common nuclear diseases which remained independent of their prevalence.
The literature provides support for the aetiopathophysiological and epidemiological associations between conditions forming part of the same cluster.
The term multimorbidity widely refers to the existence of numerous medical conditions in a single individual.1 In many regions of the world there is evidence that a substantial, and probably growing, proportion of the adult population is affected by multiple chronic conditions. Moreover, the association of multimorbidity with increasing age leading to a two-fold prevalence in the final decades of life has been proven.2 Multimorbidity has been estimated to be at around 62% between 65 and 74 years, and around 81.5% after 85 years.3 Its true extent is, however, difficult to gauge as there is no agreed definition or classification system.4–7
Most of the published literature focuses on diseases rather than individuals as the unit of analysis in assessing multimorbidity patterns.8 Orienting the analysis of multimorbidity patterns at an individual level, and not of disease, could have crucial implications for patients. In the current context of limited evidence on interventions for unselected patients with multimorbidity, such an approach–would allow better understanding of population groups, and facilitate the development and implementation of strategies aimed at prevention, diagnosis, treatment and prognosis. It would also elicit essential information for the development of clinical guidelines, pathways of care, and lead to better understanding of the nature and range of the required health services.9 10
Cluster analysis involves assigning individuals so that the items (diseases) in the same cluster are as similar as possible, while individuals belonging to different clusters are as dissimilar as possible. The identification of clusters is based on similarity measures and their choice may depend on the data or the purpose of the analysis.11 12 Hard clustering forces each element to belong to a single cluster, whereas soft clustering (also referred to as fuzzy clustering) allows elements to be simultaneously classified into multiple clusters.
Empirical evidence is needed on how both established and novel techniques influence the identification of multimorbidity patterns. A recent systematic review recommended that future epidemiological studies cover a broad selection of health conditions in order to avoid missing potentially key nosological associations and enhance external validity. When many conditions are considered, the clustering of individuals based on morbidity data will encounter high-dimensional issues. This is particularly important when a clustering-based approach is adopted to assess the impact of multimorbidity on individual health outcomes and health service uses.2 8 13–15
The identification of multimorbidity patterns seems to be implicitly dependent on the prevalence of the included diseases.2 8 16 17 However, to the best of our knowledge no previous study has analysed the identification of multimorbidity patterns explicitly based on the prevalence of the diseases.
The aim of this study was to identify, with soft clustering methods, multimorbidity patterns in the electronic health records of a population ≥65 years, and to analyse such patterns in accordance with the different prevalence cut-off points applied.
A cross-sectional analysis was carried out in Catalonia (Spain), a Mediterranean region of 7 515 398 inhabitants (2012). The Catalan Health Institute provides universal coverage and operates 284 primary healthcare centres (PHC).
Since 2006 the Information System for Research in Primary Care (SIDIAP) database includes anonymised longitudinal electronic health records from primary and secondary care which gather information on demographics, diagnoses, prescriptions and socioeconomic status.18 In our study the inclusion criteria were individuals aged 65–99 years on 31 December 2011 with at least one PHC visit since 2012. Only participants who survived until 31 December 2012 (index date) were included in the analysis.
Diseases were coded in the SIDIAP using the International Classification of Diseases version 10. An operational definition of multimorbidity was the simultaneous presence of more than one of the selected 60 chronic diseases previously identified by the Swedish National study of Aging and Care in Kungsholmen (SNAC-K).19
Additional variables included in the study were sociodemographics (age, sex, socioeconomic status (MEDEA index20)), clinical variables (including number of chronic diseases and invoiced drugs) and use of health services (number of visits to family physicians, nurses and emergency services).
Descriptive statistics were used to summarise overall information. Disease prevalence was computed for all the included population. Descriptive analyses were stratified by the presence of multimorbidity. Comparison was performed using Student’s t-test or Mann-Whitney test for continuous variables and χ2 test for categorical ones.
In order to obtain the most representative clusters all patients were included irrespective of whether they presented multimorbidity or not. Sex and age variables, together with chronic diseases selected by prevalence, were included in the analysis. The number of features to be considered varied from the 62 original ones (no prevalence filtering applied) to 54 and 49, for a 1% and 2% prevalence threshold, respectively.
Due to the large number of diseases, a principal component analysis for categorical and continuous data (PCAmix) was implemented to reduce complexity. With this technique both continuous and dichotomous variables were simultaneously processed through the application of Multiple Correspondence Analysis to the binary variables and PCA to the continuous ones. Using Karlis-Saporta-Spinaki criterion to select the optimal number of dimensions to retain, the data set of 49 features per individual per 2% prevalence cut-off was transformed to a new dimensionally reduced data set of 13 continuous features per individual, which concentrated most of the variability of the newly transformed data set.21
Once the transformed data set was obtained, clusters of chronic conditions at baseline were identified using the fuzzy c-means clustering algorithm.22 This Machine learning technique forces every individual to belong to every cluster in accordance with its characteristics and by assigning a membership degree factor in (0, 1) to each individual with respect to each pattern. This provides the flexibility enabling patients to belong to more than one multimorbidity pattern.23
The main parameters in this clustering procedure were the number of clusters and a fuzziness parameter, denoted m, which ranged from just above 1 to infinity. High m values produce a fuzzy set of clusters, so that individuals are equally distributed across clusters, whereas lower ones generate non-overlapped clusters. Further details on the stability and validation techniques applied to obtain the best fuzzy c-means parameters and the set of centroids are presented in online supplementary additional file 1.
To describe the multimorbidity patterns, frequencies and percentages of diseases (P) in each cluster were calculated. Observed/expected ratios (O/E ratios) were calculated by dividing disease prevalence in the cluster by disease prevalence in the overall population. As the membership of each individual to any of the clusters was given by a membership degree factor, and not as a binary variable, the observed disease prevalence (O) in a cluster was computed as the sum of the disease membership degree factors corresponding to all individuals suffering the disease. Exclusivity, defined as the proportion of patients with the disease included in the cluster over the total number of patients with the disease, was also calculated. Further details on how these ratios were computed using the membership factors are given in online supplementary additional file 1. A disease was considered to be part of a multimorbidity cluster when O/E ratio was ≥2 or exclusivity value ≥25%.24 Clusters names were also defined taking into account the dominant gender or age in the cluster compared with the overall sample distribution.
We conducted a sensitivity analysis by modifying the prevalence threshold for disease inclusion in the cluster analysis. For chronic diseases we considered as alternatives no filtering, and ≥1% and ≥2% filters among the included population. In order to conform to the Karlis-Saporta-Spinaki rule, a different number of dimensions of the transformed data set were retained to construct the clusters for every prevalence cut-off: 13 dimensions for the 2% prevalence, 14 dimensions for the 1% prevalence and 17 dimensions with no filtering. The content of each cluster was compared across filtering approaches in terms of diseases associated with that cluster, characteristics of the included population and cluster size. Clinical evaluation of the consistency and significance of these solutions was also conducted.
The analyses were carried out using R V.3.3.1 (R Foundation for Statistical Computing, Vienna, Austria). The significance level was set at 0.05.
Patient and public involvement
Patients were not involved in the study based on anonymised data.
In this study, 916 619 individuals were included (women: 57.7%; mean age: 75.4 (SD: 7.4)), and 853 085 (93.1%) of them met multimorbidity criteria (figure 1).
Participants’ characteristics are summarised in table 1. Statistically significant differences were present between the multimorbidity and non-multimorbidity groups for all the variables included in the analysis (table 1).
Among the 60 SNAC-K chronic diseases, the most prevalent were: hypertension (71.0%), dyslipidaemia (50.9%), osteoarthritis and other degenerative joint diseases (32.8%), obesity (28.7%), diabetes (25.1%) and anaemia (18.3%) (table 2).
Eight multimorbidity patterns were identified using fuzzy c-means algorithm with fuzziness parameter of m=1.1, after computing different validation indices to obtain the optimal number of clusters (online supplementary additional file 1). This number was the same for the three different prevalence thresholds: no filtering, and ≥1% and ≥2% filters. The cluster formed by the most prevalent diseases was designated non-specified, youngest-old dominant (O/E ratio <2 and exclusivity <20). The remaining seven clusters were specific: nervous and digestive; respiratory, circulatory and nervous; circulatory and digestive; mental, nervous and digestive, female dominant; mental, digestive and blood, female oldest-old dominant; nervous, musculoskeletal and circulatory, female dominant; and genitourinary, mental and musculoskeletal, male dominant (table 3). Table 3 shows the results, considering a 2% prevalence filter, for each pattern based on the 15 diseases with the higher O/E ratios.
Women were more represented than men in almost all clusters, from 52.7% for respiratory, circulatory and neurological to 83.6% for mental, nervous and digestive, female dominant. The exception was genitourinary, mental and musculoskeletal, male dominant in which men made up 90.9% due to the presence of male reproductive system diseases (table 4).
The highest O/E ratio and exclusivity value were observed in nervous and digestive for Parkinson, parkinsonism and other neurological diseases (17.0% and 74.3%, and 15.9% and 69.4%, respectively). The lowest values were found in non-specified, youngest-old dominant. Clusters 1–3 presented the highest median number of visits with circulatory and digestive being associated with the greatest number of visits over a 1 year period (median 18 visits), and the non-specified, youngest-old dominant pattern presenting the lowest median number of visits which was equal to 5 (table 4). Online supplementary additional file 2 shows tables of variables characterising each cluster in baseline study for 1% and for no prevalence cut-off points.
Multimorbidity patterns varied according to requirements for minimal prevalence of selected conditions in the population. As an example, figure 2 depicts the composition of cluster 1 according to prevalence levels of disease, and the other clusters are shown in online supplementary additional file 3. Disease prevalence varied more greatly in the less populated patterns (eg, non-specified, youngest-old dominant) (online supplementary additional file 3). Nevertheless, there was a group that remained in some clusters across all prevalence levels, for instance, some in neurological and digestive (Parkinson and parkinsonism, other neurological diseases, chronic liver diseases, chronic pancreas, biliary tract and gall bladder diseases) formed part of the cluster regardless of changes in cut-off prevalence (online supplementary additional file 3). The selected level of prevalence resulted in changes in O/E ratios, with some of them doubling their values.
The soft clustering method we employed identified eight multimorbidity patterns, regardless of the prevalence selected. The non-specified, youngest-old dominant cluster included the largest number of individuals and those who presented the smallest multimorbidity prevalence. In this pattern diseases did not exhibit an association higher than chance because values of the O/E ratio and exclusivity were less than 2% and 20%, respectively. This suggests that such patients during their lives could change group. Two clusters presenting gender dominance were observed: nervous, musculoskeletal and circulatory, female dominant was predominately made up of women >70 years, while genitourinary, mental and musculoskeletal, male dominant was mostly formed of men of the same age. Such patterns represent 61% of the elderly participants included in the study. The rest had fewer individuals and some diseases were over-represented such as Parkinson and parkinsonism in nervous and digestive, and asthma in respiratory, circulatory and nervous.
We observed that some diseases with O/E ratios ≥2 were consistently associated with each other as part of the same clusters (for instance, nervous and digestive; respiratory, circulatory and nervous; circulatory and digestive; and mental, nervous and digestive, female dominant) regardless of the prevalence threshold that had been set. They can be considered core components of those clusters. Further research is needed to establish the role of these conditions from a longitudinal perspective.
Comparison with the literature
Comparison with other studies is hindered by variations in methods, data sources and structures, populations and diseases studied. Nevertheless, there are similarities with other authors. The non-specified pattern is the one most replicated in the literature, for example, Prados-Torres et al who employed an exploratory factor analysis25 and our group with k-means.24 Specifically, although the age range and the exclusivity threshold in our previous study were different, the hard clustering method provided clusters that overlap with some of the patterns obtained in this study, since both clustering results were predominantly defined by the O/E ratio (≥2) criteria. However, the soft approach allows a more flexible distribution of the individual and diseases.
Recent research has provided support for physiopathological and genetic associations that explain the observed multimorbidity patterns. For instance, neurological and digestive included chronic liver disease which has been linked to Parkinson through the accumulation of toxic substances in the brain (ammonia and manganese) and neuroinflammation.26 A higher risk of Parkinson among patients with chronic hepatitis C virus has also been reported (OR: 1.35),27 in addition to associations between digestive diseases and neurodegenerative ones (eg, Parkinson and Alzheimer) through the microbiome-gut-brain axis.27 A possible link between microbiota and digestive diseases such as chronic pancreatitis and pancreatic cancer has also been suggested.28 29 For the respiratory, circulatory and neurological cluster there is evidence of an association between chronic bronchial pathology, particularly asthma and chronic obstructive pulmonary disease (COPD), and the risk of cardiovascular events.30 Longitudinal studies have observed an increased risk of developing Parkinson among individuals suffering from asthma and/or COPD.31 32 The association between asthma and allergy is known, and its coexistence defines a specific phenotype. For the circulatory and digestive cluster, non-alcoholic fatty liver disease has been associated with the development of atrial fibrillation,33 and hepatitis C infection with an increase in the risk of developing cardiovascular and cerebrovascular events.34 In addition, anaemia has been associated with advanced stages of chronic renal diseases and erythropoietin deficiency.35 Iron deficiency anaemia has been associated with an increased risk of stroke36 through thromboembolic phenomena secondary to reactive thrombocytosis. Chronic kidney disease produces auricle injuries (dilatation, fibrosis) and systemic inflammation, both of which can favour the onset and maintenance of atrial fibrillation.37
Strengths and limitations
A major strength of this study is that it has employed a large, high-quality database made up of primary care records representative of the Catalan population aged ≥65 years.18 Patterns of multimorbidity have been studied based on the whole eligible sample. This approach is epidemiologically robust as the prevalence of diseases has been estimated on the whole sample rather than limited to patients with multimorbidity.2 Another strength is that individuals rather than diseases have been considered as the unit of analysis.8 24 Such an approach permits a more realistic and rational monitoring of participants than cohort studies in order to analyse multimorbidity patterns along time. Moreover, the use of different prevalence cut-offs to obtain multimorbidity patterns has allowed the identification of nuclear diseases. We selected the higher prevalence (2%) because the patterns obtained had more clinical representativeness. The inclusion of all the potential diagnoses may have signified a greater complexity that would have hindered both the interpretation of findings and comparison with other studies.
Compared with hierarchical clustering, fuzzy c-means cluster analysis is less susceptible to: outliers in the data, choice of distance measure and the inclusion of inappropriate or irrelevant variables.38 Nevertheless, some disadvantages of the method are that different solutions for each set of seed points can occur and there is no guarantee of optimal clustering.11 To minimise this shortcoming, we carried out 100 cluster realisations with different seeds to finally use the average result of all of them. In addition, the method is not efficient when a large number of potential cluster solutions are to be considered.38 To address this limitation, we computed the optimal number of clusters using analytical indexes (online supplementary additional file 1).
Other limitations need to be taken into account. The dimensional reduction method performed in this work to reduce data complexity was PCAmix. Such methods can produce low percentages of variation on principal axes and make it difficult to choose the number of dimensions to retain. In order to decide on the most suitable number of dimensions we applied the Karlis-Saporta-Spinaki rule27 which resulted in a 13-dimensional space for the 2% prevalence cut-off. Furthermore, the feasibility of developing clinical practice guidelines in accordance with these patterns might prove difficult due to the dimension of the diseases included in each pattern. Nonetheless, new clinical practice guidelines should consider the diseases that are over-represented (O/E ratio ≥2).
Implications for practice, policy and research
Soft clustering methods offer a new methodological approach to understanding the relationships between specific diseases in individuals. This is an essential step in improving the care of patients and health systems. Analysing multimorbidity patterns permits the identification of patient subgroups with different associated diseases. Our analysis focuses on groups of patients as opposed to diseases. In this case, a disease is present in all patterns (clusters), but in different degrees. In this context, the O/E ratios are used to measure which diseases are over-represented in each cluster and to lead the clinical practice guidelines. The inclusion of varying cut-off points (prevalence filters) of the diseases that form the multimorbidity patterns allowed us to identify common nuclear diseases that remained independent from the prevalence that build such patterns.
It is noteworthy that 60% of the population ≥65 years was included in multimorbidity patterns made up of the most prevalent diseases. The rest of the population was grouped into five more specific patterns which permitted their better management.
While clinical guidelines are currently aimed at covering the management of the diseases found in the non-specified, youngest-old dominant cluster, there is a lack of information regarding the associated diseases in the other patterns. The challenge will be to refocus healthcare policy from that based on individual diseases, with the accompanying consequences (increased risk of functional decline, poorer quality of life, greater use of services, polypharmacy and increased mortality), to a multimorbidity orientation.39
Further investigation on this topic is called for with particular focus on five major issues. First, the genetic study of these patterns will help the identification of risk subgroups. Second, research is needed on the life style and environmental factors (diet, physical exercise, toxics) associated with such patterns. Third, longitudinal studies should be performed to establish the onset order of the core diseases. Fourth, alternative approaches to handle covariates in cluster analysis should be addressed in future analysis plan. Recently, a new method that allows the covariates to be incorporated into the membership factor to model individual probabilities of cluster membership has been proposed.40 And fifth, the characteristics of the diseases in the same cluster and their potential implication on the quality of primary care should be ascertained in greater detail.
Our findings suggest non-hierarchical cluster analysis identified multimorbidity patterns and phenotypes of certain subgroups of patients that were more consistent with clinical practice.
CV and QF-B contributed equally.
Contributors All authors contributed to the design of the study, revised the article and approved the final version. CVF, ARL and SFB obtained the funding. CVF, QFB and SFB drafted the article. CVF, QFB, SFB, MGC, MCB, FF, JMV and ARL contributed to the analysis and interpretation of data. CVF, QFB and SFB wrote the first draft, and all authors contributed ideas, interpreted the findings and reviewed rough drafts of the manuscript.
Funding This work was supported by a research grant from the Carlos III Institute of Health, Ministry of Economy and Competitiveness (Spain), awarded on the 2016 call under the Health Strategy Action 2013–2016, within the National Research Program oriented to Societal Challenges, within the Technical, Scientific and Innovation Research National Plan 2013–2016 (grant number PI16/00639), cofunded with European Union ERDF funds (European Regional Development Fund) and Department of Health of the Catalan Government, in the call corresponding to 2017 for the granting of subsidies from the Strategic Plan for Research in Health (Pla Estratègic de Recerca i Innovació en Salut, PERIS) 2016–2020, modality research oriented to primary care (grant number SLT002/16/00058) and from the Catalan Government (grant number AGAUR 2017 SGR 578).
Disclaimer The views expressed in this publication are those of the author(s) and not necessarily those of the National Health Service, the National Institute for Health Research or the National Department of Health.
Competing interests None declared.
Patient consent for publication Not required.
Ethics approval The protocol of the study was approved by the Committee on the Ethics of Clinical Research, Fundació Institut Universitari per a la recerca a l'Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol) (P16/151). All data were anonymised and the confidentiality of EHR was respected at all times in accordance with national and international law.
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.