Article Text

Download PDFPDF

Original research
Variables associated with COVID-19 severity: an observational study of non-paediatric confirmed cases from the general population of the Basque Country, Spain
  1. Kalliopi Vrotsou1,2,
  2. Rafael Rotaeche3,
  3. Maider Mateo-Abad2,
  4. Mónica Machón1,2,
  5. Itziar Vergara1,2
  1. 1Primary Care Group, Biodonostia Institute for Health Research, Donostia-San Sebastián, Spain
  2. 2Research Network in Health Services in Chronic Diseases (REDISSEC), Kronikgune Health Services Research Institute, Baracaldo, Spain
  3. 3Alza Health Center, Osakidetza-Basque Health Service, Donostia-San Sebastian, Spain
  1. Correspondence to Kalliopi Vrotsou; kalliopi.vrotsoukanari{at}


Objectives To investigate which were the most relevant sociodemographic and clinical variables associated with COVID-19 severity, and uncover how their inter-relations may have affected such severity.

Design A retrospective observational study based on electronic health record data.

Participants Individuals ≥14 years old with a positive PCR or serology test, between 28 February and 31 May 2020, belonging to the Basque Country (Spain) public health system. Institutionalised and individuals admitted to a hospital at home unit were excluded from the study.

Main outcome measure Three severity categories were established: primary care, hospital/intensive care unit admission and death.

Results A total of n=14 197 cases fulfilled the inclusion criteria. Most variables presented statistically significant associations with the outcome (p<0.0001). The Classification and Regression Trees recursive partitioning methodology (based on n=13 792) suggested that among all associations, those with, age, sex, stratification of patient healthcare complexity, chronic consumption of blood and blood-forming organ, and nervous system drugs, as well as the total number of chronic Anatomical Therapeutic Chemical types were the most relevant. Psychosis also emerged as a potential factor.

Conclusions Older cases are more likely to experience more severe outcomes. However, the sex, underlying health status and chronic drug consumption may interfere and alter the ageing effect. Understanding the factors related to the outcome severity is of key importance when designing and promoting public health intervention plans for the COVID-19 pandemic.

  • COVID-19
  • public health
  • statistics & research methods

Data availability statement

Data may be obtained from a third party and are not publicly available. The data of the current study are stored in a server of our institution. Sharing them with external investigators will be evaluated on an individual basis and will require an approval by the Osakidetza central services. The corresponding author should be contacted.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • Over 13 000 confirmed COVID-19, non-institutionalised, cases aged ≥14 years old were explored.

  • Electronic health records data were a valuable source of information in this study.

  • The three-category outcome severity—primary care only, hospitalised/intensive care unit care and death—was studied in a joint manner.

  • The Classification and Regression Trees methodology allowed exploring the big sample and the numerous variables of interest in a flexible way.

  • Information on COVID-19 symptoms was not properly registered during the first pandemic wave.


In December 2019, the new coronavirus SARS-CoV-2 initiated the COVID-19 disease in China, which soon afterwards, on March 12, was declared a pandemic by the WHO. The rapid expansion of the virus, along with its high death toll and the serious health aftermaths, has rendered the COVID-19 outbreak as one of the worst health crises in almost a century worldwide.

Since the first infections were detected in Spain, the statistics have situated this country among the most affected in Europe, both in terms of total cases and in deaths per million people.1 International literature on COVID-19 is rapidly growing.2–7 The research conducted so far in Spain has focused mainly on predicting the evolution of the pandemic8 or the factors associated with mortality.9 Hospitalised individuals have also been described,10 and the variables related to severe outcomes in these populations have been explored.11 12 But so far, none of the previous works has considered the gradient of the COVID-19 severity by studying a multiple category outcome.

The autonomous community of the Basque Country, situated in the north of Spain, has its own public health system (Osakidetza), which offers sanitary coverage to some 2.3 million people. Since 2009, Osakidetza has promoted an integrated healthcare model, by coordinating its different care levels and offering a more holistic approach on patient care.13 It counts with an extensive electronic health records infrastructure, where information on patient health data and episodes of care is stored. The objective of the present observational study was to describe a big series of COVID-19-infected individuals during the first pandemic wave; establish their infection severity level, based on electronic health record data and explore what characteristics may be associated with that severity. To this end, the Classification and Regression Trees (CART) methodology was applied. This statistical technique splits the sample into mutually exclusive subgroups that share the same characteristics and can be particularly useful when analysing big data sets.14


Data source and variables

All information was extracted from the electronic health records of the Basque Country Public Health System-Osakidetza, via the Osakidetza Business Intelligence tools. Data extraction covered the period between 28 February 2020 and 31 May 2020, corresponding to the first detected case in the Basque Country and the end of the first pandemic wave in Spain. Only the health records of individuals ≥14 years old with a COVID-19 positive PCR or antibody test were included, as no antigen tests were performed at that time. Data of cases living in residential homes or those admitted to a hospital at home unit were excluded.

The following variables were studied: age, sex and income level derived by the pharmaceutical co-payment scheme (<18 000€, 18 000–100 000€, >100 000€). Chronic medication consumption was explored using the Anatomical Therapeutic Chemical (ATC) system at the first level ( Polypharmacy, defined as the consumption of five or more chronic drugs, and the number of ATC types consumed were derived. Chronic pathologies based on the International Classification of Diseases (ICD)-9 codes, COVID-19 symptoms registered during consultations and influenza vaccination in the year 2019 were also considered. The Osakidetza stratification according to patient healthcare complexity was studied. Based on a series of health data and the use of health services during the previous year, this variable classifies individuals into four categories, ranging from less to more severe: prevention and promotion of healthy population, self-management support, disease management and case management. Pluripathological individuals belong to the last category. This classification is renewed at the beginning of every calendar year, for all individuals ≥14 years registered in the Osakidetza system at least during the previous 6 months. A detailed description can be found elsewhere.15

Given that the data were anonymous and clinical analyses could not be conducted, it was assumed that the severity of a case would be indicated by the most demanding level of medical attention received, within the study period. Four severity levels were initially identified: primary care attention only (PC), hospitalisation without intensive care unit (ICU) admission (hospital), ICU admission (ICU) and death. During the pandemic, several emergency ICU units were set up within hospitals across the Basque Country. Nevertheless, this information was not reflected in the electronic health records. As a result, cases admitted to such ICUs were registered as hospital admissions. This fact imposed the necessity to merge hospital and ICU admissions into one category in the current work. Cases meeting the inclusion criteria were considered only once in the current analyses.

Patient and public involvement

Due to the study design, no patient and public involvement was considered. Nonetheless, two of the authors are medical doctors, who have offered valuable support during this work.

Statistical analysis

Continuous variables are presented as means with SDs, while medians and IQRs (Q1–Q3) are given for discrete variables. Categorical variables are presented with frequencies and percentages (%). Three-group unadjusted comparisons were performed with the one-way analysis of variance, Kruskal-Wallis and χ² test, respectively. The Jonckheere-Terpstra and Mantel-Haenszel χ², both testing for a trend along the three severity groups, were additionally tested.16

Classification and Regression Trees

The CART methodology is a non-parametric statistical tool, which can be very useful when handling big data sets with many variables. This statistical technique partitions the sample into smaller homogeneous groups that share the same characteristics. The splitting process starts considering the whole sample that is then recursively partitioned into mutually exclusive subsamples according to the most important variables, selected among all candidate variables. Important variables in CART are those that minimise the variability of the outcome within each subsample. This process results in a tree-like structure with multiple levels, which offers a visual representation of which variables affect the outcome the most. At the same time, it allows understanding the inter-relations the indicated factors may have with one another. CART analysis is a flexible option for data sets with correlated variables, as in our case.14 17

The starting point of the tree structure is the root node and each split is an offspring node. Offsprings that do not split any further are called terminal. In the current analyses, splitting was based on the entropy criterion and each variable was allowed only once per tree branch. For a stopping rule, the number of terminal nodes and the observations included in each of them were considered. A tree with 10 terminal nodes, each including at least 1% of the valid sample data was selected. Cost-complexity pruning was applied. Variables with significance levels p>0.010 in the three-group comparisons and those with a total frequency <1% of the valid sample were excluded from the CART stage, while missing data were omitted.14 Analyses were performed with the SAS software V.9.4 (Copyright 2016 by SAS Institute). The SAS proc hpsplit function was used for tree construction.


A total of n=14 197 COVID-19 cases fulfilled the inclusion criteria. Of these, n=9722 (68.5%) received PC attention only, n=3710 (26.1%) had a hospital or ICU admission (n=3630 and 80, respectively), and n=765 died (5.4%). Most cases were detected via PCR (n=8933), and this detection method was the most prevalent in all three outcome groups (PC: 51.0%, hospital/ICU: 87.7%, death: 93.3%). Table 1 presents the baseline information of the sample. Overall, mean age was 53.7 (SD: 17.4) years, and it increased with outcome severity. Most infected cases were women, but at the same time this sex group presented lower infection severity. In particular, women were more prevalent in PC (68.4%), whereas more men were observed in the hospital/ICU and death groups. As far as the healthcare complexity stratification variable was concerned, the PC outcome group presented the highest percentage of healthy individuals (36.1%), while case management was most prevalent in the death outcome group (36.6%). Based on the available information, individuals with an annual income <18 000€ were more prevalent in the hospital/ICU and death groups, and those with higher income received mostly PC attention. Finally, the death group had the highest percentage of individuals with an influenza vaccination in the previous year. This observation was consistent for cases <65 and ≥65 years of age, even though the corresponding percentages of the older cases were higher. All comparisons were statistically significant.

Table 1

Baseline information of the COVID-19 cases during the first wave of the pandemic

Chronic medication consumption data are presented in table 2. Overall, the most consumed medications were those for the nervous system (38.7%), alimentary tract and metabolism (33.0%), and cardiovascular system (30.2%). With the exception of musculoskeletal system and antiparasitic products, insecticides and repellents, an increasing consumption trend with severity was observed in all other ATC types. The consumption of alimentary tract and metabolism disorders (A), blood and blood-forming organs (B), cardiovascular system (C) and nervous system diseases drugs (N) exceeded 60% in the death group. Both polypharmacy and the number of ATC types consumed were associated with infection severity.

Table 2

Chronic medication consumption of the COVID-19 sample

Regarding the chronic diseases, the most prevalent condition was related to mental pathologies (table 3). In particular, 30% of the sample had received a diagnosis corresponding to the ICD-9 neurotic, personality or other non-psychotic mental disorders. Hypertension was the next more prevalent condition (21%), followed by diseases of the blood and blood-forming organs (11.3%), diseases of the oesophagus, stomach and duodenum (10.4%). Diabetes mellitus was present in 8.5% of the sample. With the exception of neurotic, personality or other non-psychotic conditions that presented the same distribution along the three outcome groups, the prevalence of the most frequent pathologies increased with COVID-19 severity. A similar trend was seen in the total number of chronic diseases. Non-infectious enteritis and colitis, and allergic asthma were the only chronic conditions presenting a descending prevalence with outcome severity, but percentage differences were low.

Table 3

Chronic diseases of the COVID-19 cases in the three outcome groups

Classification and Regression Trees

The CART process indicated that age, sex, healthcare complexity stratification, the ATC categories of blood and blood-forming organ medication (B), as well as nervous system drugs (N) along with the frequency of ATC types consumed would be the most relevant variables in understanding the main case characteristics associated to the outcome. During this process, the variable of psychosis was also flagged as important. In spite of its low prevalence (2.9%), psychosis was given a lot of weight in the older section of the population. The inclusion of this pathology resulted in a less parsimonious model; with ATC-N drugs placed in an additional tree level. Nonetheless, given that psychosis was the single variable resulting in a node with a death majority, and that other authors have already suggested an association between antipsychotic drugs and mortality in COVID-19 cases,9 presenting the corresponding findings was considered of relevance. Therefore, the CART process was repeated twice, first excluding and afterwards including psychosis.

Excluding psychosis

The tree generated by the CART process is depicted in figure 1. Most cases <64.7 years of age (81.4%, node 1) received mainly PC attention. In this tree branch, men presented 15.3% more hospital/ICU compared with women. Among men, those with worse health (node 8) had 19.2% more hospital/ICU admissions, compared with the rest (node 7). The majority of men with worse baseline health status who consumed ≥3 ATC types experienced a hospital/ICU admission.

Figure 1

CART model without psychosis. Percentages up to three decimal places are given. Due to rounding, node percentages may not add to 100%. Pre/Pro & Self: Prevention and promotion, and Self-management support. Disease & Case: Disease management and Case management. White nodes represent groups with a higher percentage of PC care. Grey nodes represent groups with a higher percentage of hospital/ICU admission. ATC, Anatomical Therapeutic Chemical; CART, Classification and Regression Trees; Hos/ICU, hospital and intensive care unit; PC, primary care.

Cases ≥64.7 years of age had mainly a hospital/ICU outcome (52.6%), with a considerable death prevalence (21.2%). Those with worse baseline health (node 6) had 4.6% more hospital/ICU admissions and 15.8% more deaths, compared with the rest. Cases with better baseline health status (node 5) were further split according to ATC-B consumption. Death for blood and blood-forming organ drug consumers was experienced in 23.2% of the cases, with the same outcome being 9.1% in non-consumers. Within this last group (node 9), the majority of women received PC attention, while hospital/ICU was the most prevalent outcome in men. In a similar way, among node 6 cases, men presented worse evolution than women. Finally, men consuming chronic medications for the nervous system (node 18) had 17.9% more deaths compared with non-consumers. The 10 terminal nodes can easily be ordered from less severe (ie, <64.7year-old women, node 3) to most severe outcome groups (ie, ≥64.7 year-old disease and case management men who consume nervous system drugs, node 18).

Including psychosis

The resulting CART model when the variable of psychosis was included in the recursive process is presented in figure 2. Psychosis was one of the main variables of this model and the single split variable for node 6. Inclusion of this pathology added one more level to the CART tree, with chronic nervous system drugs being a split variable for node 15. Cases with psychosis had a 50% death. The ATC-N consumers presented less PC and higher death compared with non-consumers. No other changes were observed compared with the figure 1 model. In this case, the most severe outcome group was ≥64.7 year-old disease and case management cases with psychosis (node 12).

Figure 2

CART model with psychosis. Percentages up to three decimal places are given. Due to rounding, node percentages may not add to 100%. Pre/Pro & Self: Prevention and promotion, and Self-management support. Disease & Case: Disease management and Case management. White nodes represent groups with a higher percentage of PC care. Grey nodes represent groups with a higher percentage of hospital/ICU admission, and grey-striped node groups with a higher percentage of death. ATC, Anatomical Therapeutic Chemical; CART, Classification and Regression Trees; Hos/ICU, hospital and intensive care unit; PC, primary care.


The present work has studied the sociodemographic and clinical characteristics of a big number of Spanish COVID-19 cases of the first pandemic wave. According to the information extracted from electronic health record data, the variables of age, sex, previous pathologies and chronic drug consumption may be decisive in understanding infection severity.

Both age and male sex have been flagged as important risk factors by previous COVID-19 research.2 3 7 9 11 12 18 The importance of age is probably undisputable, given the deterioration of the body’s immunity mechanisms and the loss of its capacity to adapt to the environment.19 The present data appear to reflect this known ageing effect. In relation to the variable of sex, women presented consistently higher PC and lower hospital/ICU in the splits where sex was present. Data from various countries are suggesting that women have better COVID-19 infection outcomes than men.7 20 Women are considered to have stronger immunity systems.21 Even though the exact mechanisms responsible for these differences in the COVID-19 context are still unclear and probably multifactorial,20 certain works are hypothesising that low androgen levels can have a protective role against this disease.22 The current data, in conjunction with previous evidence, call for a better understanding of the role of sex in the current pandemic. Sex-specific analyses of future wave data should be planned. But more importantly, high-quality prospective studies collecting sex-disaggregated data are needed.23

The healthcare complexity stratification variable was present in both main tree arms. It should be mentioned that the way CART divided this four-category variable into a binary one, by merging the two less severe versus the two more severe groups, was imposed by the data, not the investigators. Worse health status at the time of the infection was associated with more hospitalisations for younger cases, and mainly to more deaths among older individuals. The inclusion of this stratification variable in the CART model is a relevant finding. Tools that stratify the general population, identifying those at greater risk, can be an asset for public health prevention programmes. In the COVID-19 literature, the stratification approach has so far mainly focused on hospitalised patients.12 24 25 While one meta-analysis of in-hospital cases claimed that in COVID-19 infections, underlying health conditions are even more important than age.26 Our data suggest that, at least at the local level, this very stratification variable can offer valuable information and its implementation may worth be considered when setting up public health action plans. The study of similar indicators used in other health systems would be encouraged.

As far as the drug consumption was concerned, chronic blood and blood-forming organ drugs (B) and drugs for the nervous system (N), both appeared as important variables for cases ≥64.7 years of age. Cases consuming those drugs presented higher severity levels. ATC-N was the most frequent medication across all three outcome groups. ATC-B had the steepest raising in consumption from one severity level to the next. Several neurological manifestations after a COVID-19 infection have been described in the literature, with the virus perceived by certain authors as a threat for the whole nervous system.27 It is probable that individuals already suffering from chronic neurological conditions may be indeed more likely to present worse outcomes once infected.28 29 Blood-related parameters like systolic and diastolic pressure, red and white cell counts, platelets, lymphocytes, among others, have been highlighted as significant predictors in different COVID-19 diagnostic models.7 An association between certain ATC-B drugs and higher odds of death in infected cases has also been observed.9 Chronic anticoagulation treatment is referenced as protective against COVID-19 mortality by some,30 and ineffective by others.31 COVID-19 cases present a high frequency of thrombotic events, which is leading to an expansion of anticoagulation drug use when treating the disease.32 But in patients already receiving such drugs prior to infection, drug–drug interactions and infection severity should be carefully assessed before any antiviral therapy is given, or switching from oral to parenteral antithrombotic administration.33 Worse severity seen among ATC-B consumers in the current data may reflect also an increased risk for patients already under anticoagulation therapy. Poor outcomes due to therapeutic decisions and drug–drug interactions cannot be excluded either. Our continuing COVID-19 work will refine future data explorations. Obtaining, for example, ATC data at the second or third level, as well as information of inpatient treatments, will offer more insight into these associations.

Psychosis was a relevant variable in the CART process. Antipsychotic drugs belong to the ATC-N medication type, which is probably why allowing for the inclusion of psychosis relocated this drug group further down in the tree structure. Older patients with worse baseline health and psychosis had the highest death rate among all CART nodes. We can only hypothesise over the mechanisms that could explain such a finding. On one hand, individuals with psychotic disorders present excess mortality compared with the general population, mainly due to lifestyle choices, associated comorbidities and medication side effects.34 On the other hand, the treatment management of these cases is challenging as alteration or abrupt cessation of their current medication could potentially lead to a sudden health deterioration or even death.35 This could happen, for example, during hospital and ICU admissions. In the present sample, 75% of the deaths seen in the psychosis node had been admitted to a hospital during the study period. The available information does not allow knowing whether death took place during the admissions, neither the inpatient treatment regime. An observational US study of >60 000 cases claimed that psychiatric disorders are a risk factor associated with higher COVID-19 diagnosis; with psychosis presenting greater risk ratios versus mood and anxiety disorders. The same study also reported an increased risk of first-time psychiatric disorders for survivors.36 Others have suggested that antipsychotic use9 and schizophrenia spectrum disorders37 are associated with higher COVID-19 mortality. Even though more research in this direction is required, the available data seem to highlight the need for a close monitoring of cases with psychiatric disorders.

The total number of chronically consumed ATC types was an important variable among cases <64.7 years of age. This variable, which could also be perceived as an indicator of the associated comorbidities, stresses even more the importance that underlying pathologies may have in determining the severity of the infection outcome.26

In this work, a surrogate outcome variable has been used. Assuming that more intensive care levels represented worse COVID-19 status is a decision also taken by previous authors.11 38–40 The available data do not allow studying if admissions and deaths may have been due to other health problems. The female prevalence of this sample was greater than that seen in other COVID-19 publications,3 4 7 but nonetheless similar to previous studies performed in this country.9 11 In the Spanish reality, women traditionally assume the caretaker’s role for younger and older members of their families, while they also occupy more home-assisting jobs41 and health-related professions.42 All these conditions may imply higher exposure rates to the virus, which may offer a possible explanation for the sample’s sex distribution.

The current study has certain limitations. The implemented information is based exclusively on electronic health record data within the previously defined dates. After that period, the severity of certain cases may have worsen. Nonetheless, the end study date corresponds to the end of the first COVID-19 wave in our area, where new infections and deaths were very low. This, in combination with the big study sample, should have minimised the effect of possible outcome variations. No COVID-19 symptoms are presented. An attempt to register these symptoms was incorporated at the Osakidetza electronic records early on after the outbreak. But the number of symptoms and registration format evolved over the study period; PC and hospital registrations differed; the medical staff mostly annotated symptoms in text format; while most importantly such registration was totally missing in many cases. During analysis, an effort to recode text annotations and homogenise information from primary care and hospital data was made. In spite of that, and due to the frequency of missing values, the representativeness of the corresponding data could not be assumed. Symptoms are probably more relevant for algorithms discriminating cases from non-cases.43 During the first pandemic wave, no massive population testings were performed in Spain; but at the end of that wave, serology tests were administered to the health professionals and allied services of our geographical area. Thus, identified cases were either symptomatic, close contacts of cases or individuals working in the health sector. However, the profession of the cases was not an available piece of information in this sample. Working with health records makes recovering missing data or refining variable information a very difficult task. This was also the case with the income level. Its broad categories may have obscured a more appropriate exploration. On the other hand, the high frequency of missing income level data seen in the death group is due to the ‘unsubscriptions’ of the dead cases from the medication dispensing registry. It is important to note that the target of the Basque public health system is a health coverage based on the health needs and not the earnings of the individuals.

One of the main strengths of this study is its big sample size. The consideration of three outcome groups is another advantage, which allows for a better visualisation of the different severity levels of the disease. Finally, implementing the CART methodology assisted in translating a complex and multifactorial reality into an easy-to-follow picture. Our findings make clinical sense and are supported by previous evidence. They appear to endorse the need for public health prevention plans that consider population characteristics. At the same time, they highlight that for a multifactorial problem to be properly treated, not only the factors affecting it, but also the inter-relations between the latter should be thoroughly studied. The COVID-19 pandemic may be a new starting point in the public health paradigm. The necessity for public health promoters to work hand in hand with investigators and data analysts has become indisputable under the current circumstances. Prevention plans should be based on rigorous data and understanding of the latter. This is the only way to assure that possible reorganisation and estimation of future resources can reach optimal results.

Data availability statement

Data may be obtained from a third party and are not publicly available. The data of the current study are stored in a server of our institution. Sharing them with external investigators will be evaluated on an individual basis and will require an approval by the Osakidetza central services. The corresponding author should be contacted.

Ethics statements

Ethics approval

The project has been approved by the ethics committee CEIm de Euskadi on 22 July 2020 (reference code: PI2020087).



  • Contributors IV, RR and MM planned this study and obtained the permission for exploring the corresponding data by the Osakidetza central services. MM-A set the filters and performed the data extraction of the electronic health record data. KV and MM-A were both responsible for data cleaning and recoding. KV and MM-A performed all statistical analyses. The input of IV and RR has assured a clinically meaningful perspective of all presented analyses and results. MM performed literature searches. KV drafted the first manuscript version. All authors read and contributed to the consecutive manuscript versions.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.