Article Text

Original research
Investigating the effect of macro-scale estimators on worldwide COVID-19 occurrence and mortality through regression analysis using online country-based data sources
  1. Sabri Erdem1,
  2. Fulya Ipek2,
  3. Aybars Bars3,
  4. Volkan Genç3,
  5. Esra Erpek4,
  6. Shabnam Mohammadi3,
  7. Anıl Altınata3,
  8. Servet Akar4
  1. 1Department of Business Administration, Dokuz Eylül University, Izmir, Turkey
  2. 2Faculty of Physical Therapy and Rehabilitation, Hacettepe University, Ankara, Turkey
  3. 3Social Sciences Institute, Dokuz Eylül University, Izmir, Turkey
  4. 4Department of Internal Medicine, Division of Rheumatology Atatürk Education and Research Hospital, Izmir Katip Celebi University, Izmir, Turkey
  1. Correspondence to Fulya Ipek; fulya.ipek{at}


Objective To investigate macro-scale estimators of the variations in COVID-19 cases and deaths among countries.

Design Epidemiological study.

Setting Country-based data from publicly available online databases of international organisations.

Participants The study involved 170 countries/territories, each of which had complete COVID-19 and tuberculosis data, as well as specific health-related estimators (obesity, hypertension, diabetes and hypercholesterolaemia).

Primary and secondary outcome measures The worldwide heterogeneity of the total number of COVID-19 cases and deaths per million on 31 December 2020 was analysed by 17 macro-scale estimators around the health-related, socioeconomic, climatic and political factors. In 139 of 170 nations, the best subsets regression was used to investigate all potential models of COVID-19 variations among countries. A multiple linear regression analysis was conducted to explore the predictive capacity of these variables. The same analysis was applied to the number of deaths per hundred thousand due to tuberculosis, a quite different infectious disease, to validate and control the differences with the proposed models for COVID-19.

Results In the model for the COVID-19 cases (R2=0.45), obesity (β=0.460), hypertension (β=0.214), sunshine (β=−0.157) and transparency (β=0.147); whereas in the model for COVID-19 deaths (R2=0.41), obesity (β=0.279), hypertension (β=0.285), alcohol consumption (β=0.173) and urbanisation (β=0.204) were significant factors (p<0.05). Unlike COVID-19, the tuberculosis model contained significant indicators like obesity, undernourishment, air pollution, age, schooling, democracy and Gini Inequality Index.

Conclusions This study recommends the new predictors explaining the global variability of COVID-19. Thus, it might assist policymakers in developing health policies and social strategies to deal with COVID-19.

Trial registration number Registry (NCT04486508).

  • COVID-19
  • public health
  • health policy

Data availability statement

Data are available in a public, open access repository. The data that support the findings of this study are openly available in Open Science Framework at

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This study might represent recent macro-scale data of 170 countries and COVID-19 cases and deaths for 1 year.

  • It is free from the vaccine effect.

  • It contains a reliable predictive model incorporating social, economic and political indicators as well as health-related ones.

  • Some macro-scale estimators were not recently published but we assume that there are no remarkable changes annually.

  • There might be some interruption or reliability issues in COVID-19 cases and deaths data flow from countries.


COVID-19 caused by SARS-CoV-2 has spread over countries worldwide. On 11 March 2020, the WHO declared the condition as a public health emergency. With new variants emerging, experts now assume that SARS-CoV-2 would not only remain an endemic virus but would also continue to circulate in communities and would make a massive burden of disease and death for years.1

The cumulative incidence (total number of cases per million) of COVID-19 (TotalCase-COV19) and mortality (total number of deaths per million) of COVID-19 (TotalDeath-COV19) vary significantly among countries. Besides, old age, diabetes, high blood pressure, obesity, pregnancy, immunosuppression, cancer, and cardiovascular, respiratory (asthma and chronic obstructive pulmonary disease, etc), and chronic kidney disease are all well-known health-related factors that contribute to COVID-19 morbidity and mortality.2 3 Aside from these clinical factors and comorbidities affecting infection progression, there might be additional demographic, social, economic and environmental factors at the macro-level differentiating countries in terms of TotalCase-COV19 and TotalDeath-COV19.

The term ‘macro-level data’ (eg, Human Development Index (HDI) 2020 by United Nations (UN)) could be defined as periodical (eg, weekly, monthly, quarterly or annual) and long-term data that were gathered, summarised, and published by the government, private, public, national and international organisations about a specific subject (eg, health, energy, society, economy, politics) in the world, continents, regions, countries, cities or territories. This kind of data shows the current position and trend of any subject in a comparable manner and guides the policymakers in countries and organisations for setting or updating their strategies and helps them take an action around that subject. Although healthcare executives and regulatory authorities have implemented regulations and recommendations such as wearing masks, social distancing, staying at home, isolation and lockdowns to limit the spread of SARS-CoV-2, considering these macro indicators will help steer national and international health policymakers to combat the COVID-19 pandemic.

To the best of our knowledge, there has been limited information on how the nature and dynamics of SARS-CoV-2 vary across the world.4–8 These studies have revealed that demographic, climatic, environmental, socioeconomic, and political indicators could help predict or clarify variations in TotalCase-COV19 and TotalDeath-COV19 among countries. In particular, the elderly ratio, the prevalence of comorbidities, population size, and health expenditure were shown to have been associated with the incidence and mortality of COVID-19 at various levels according to the aforementioned studies. However, these findings have been based on the early phase of the outbreak. The current study has been based on data (the cumulative number of cases and deaths) just before starting vaccinations around the world. Additionally, the findings might provide more consistent details about the progression of the disease because of covering data for a longer period. Therefore, the current study aims to identify health-related and socioeconomic macro-scale indicators that could explain some of the substantial variations in TotalCase-COV19 and TotalDeath-COV19 worldwide.


Publicly accessible data about TotalCase-COV19 and TotalDeath-COV19 were obtained from the John Hopkins University–Center for Science and Engineering.9 While the process of the pandemic was still ongoing, mass immunisation has started in several countries as of December 2019. To eliminate the effect of immunisation, in this study, TotalCase-COV19 and TotalDeath-COV19 data were censored by 31 December 2020, when vaccination has not yet become widespread. Our dataset is available in Open Science Framework. Health-related indicators from the WHO and socioeconomic, climatic, and political indicators from UN World Inequality Database and World Bank explaining variability in TotalCase-COV19 and TotalDeath-COV19 among countries were summarised in table 1. Our dataset involves 17 macro-scale variables of 170 countries.

Table 1

The list and the definitions of indicators and the outcome variable used in the analysis

Statistical analyses

In this study, we conducted multiple linear regression analyses to explain variations in our target variables (ie, TotalCase-COV19 and TotalDeath-COV19). Before the analysis, we tested all the assumptions of multiple linear regression analysis by:

  • Drawing ‘actual target variable versus predicted target value’ scatter plot diagram for testing linearity.

  • Calculating standardised residual values for each case for detecting outlier as outside +3.

  • Drawing a partial regression plot diagram for each predictor versus target variable for detecting the variance of predictors.

  • Calculating variance inflation factor (VIF) for each predictor variable in the model for testing multicollinearity.

  • Calculating Durbin-Watson test statistics for detecting autocorrelation of consequent errors.

  • Drawing ‘regression predicted value versus regression standardised residual’ plot diagram for detecting heteroscedasticity.

  • Drawing ‘observed cumulative probability versus expected cumulative probability (P–P)’ plot diagram of standardised regression residual for testing normality of residuals.

For eliminating the variance inflation and suppressing potential interactions among predictors, we have transformed all variables into the standardised form [(x−mean(x))/SD(x)], so that we could get a more acceptable VIF (VIF=1/(1−R2)<2.0) values where R2 is obtained by regression analysis in that the inspecting variable is regarded as a dependent variable and others are regarded as its predictor. We have applied Durbin-Watson statistics to test the autocorrelation in consequent error terms in the model and we have found all error terms are independent in all models since test statistics were between lower critical value and 4-lower critical value.10 After having inspected the plot diagrams and histograms, it could be said that all assumptions of multiple linear regression were almost met with the help of using standardised values of all predictors and target variables in all models. The exception for slight violation was observed as variances in error terms in P–P plots and predicted values against observed values scatter plot for homoscedasticity that could be questionable if other test results were unsatisfactory.

We conducted multiple linear regression analyses for different situations/scenarios and created predictive models to determine indicators aligned with these scenarios. After having tested the dataset against the assumptions of multiple linear regression, we investigate the dataset using some considerations for each scenario/model through a best subset approach that analyses data in terms of a predictor’s subset that could create applicable models that illuminate the next step for finding the best one. We could decide the better one by inspecting each model in terms of cost–benefit ratios such as R2, SE or Mallows’ Cp (min is preferred), Akaike Information Criterion (min is preferred) and Bayesian Information Criterion (min is preferred).11

After determining the incorporated variables in a model, we applied a hierarchical regression model with forward selection and stepwise methods. There is no constant in our models because they were all insignificant statistically. Therefore, they were discarded. Using the information from best subset, literature and best practices, we tried a variety of models by evaluating the usage of intercept, handling missing values, handling outliers, applying the best technique for regression analysis (eg, enter, stepwise, forward/backward selection), determining thresholds for alpha to enter (0.05 through 0.15) and remove (0.05 through 0.15), and for accepting or rejecting the next variable to the incorporated model. We used the SPSS V.24 and Minitab V.19 trial version for analysing the data.


The descriptive statistics of all variables were provided in table 1. We have considered possible different models, explaining the TotalCase-COV19 and TotalDeath-COV19, including both common and different estimators that could explain the variations in countries. Therefore, we have listed all possible significant models via the best subset method widely used in regression analyses for TotalCase-COV19 and TotalDeath-COV19 as demonstrated in online supplemental tables A and B.

Based on best subset analysis of TotalCase-COV19 and TotalDeath-COV19 modelling, the indicators ‘sunshine’, ‘obesity’, ‘hypertension’, ‘urbanisation’, ‘schooling’, ‘alcohol’, ‘democracy’, ‘transparency’ and ‘HDI’ are commonly found significant for the most of models. Additionally, for modelling the ‘TotalDeath-COV19’, ‘age’, ‘GINI’ (Gini Inequality Index) and ‘undernourishment’ were found significant; and for modelling ‘TotalCase-COV19’, ‘cholesterol’ was found significant. However, it was remarkable that the best subset model shows the results for only 139 countries with complete data for all variables at issue. That is why the recommended model could not be significant as in best subset because of missing values and their handling strategies. In all these regression analyses, we found that regression model fit and all parameter fits are significant (p<0.05).

Urbanisation, schooling, HDI, transparency, democracy and GINI are all inter-related factors, despite the lack of their multicollinearity. Thus, countries with desirable socioeconomic conditions are likely to have high levels of urbanisation, education, human development, transparency and democracy, as well as low-income inequality.

Table 2 shows the results of the regression model indicating the determinants of the TotalCase-COV19 worldwide. Here is the regression model indicating that transparency, sunshine, obesity and hypertension in countries could help to explain the variability of the TotalCase-COV19 across countries. In this regression analysis, R2 value shows that the model with these four indicators explains 44.9% of all variability. As a result, the generated model is acceptable and comes out with expected roles of variables as aligned with the literature.

Table 2

Parameter testing results for linear regression for predicting ‘total cases per million people’

Table 3 shows the results of the regression model indicating the determinants of the TotalDeath-COV19 worldwide. The regression model revealed that urbanisation and alcohol in addition to obesity and hypertension might help to explain the variation in the TotalDeath-COV19 across countries. In this regression analysis, the R2 value shows that the model with these four predictors explains 40.9% of all variability. Durbin-Watson value shows that there is no evidence for autocorrelation at 0.05 significance level.

Table 3

Parameter testing results for linear regression for predicting ‘total deaths per million people’

Finally, we analysed the TBDeath by using the same 17 macro-scale indicators for validating/controlling purposes of our model proposals for the TotalCase-COV19 and TotalDeath-COV19. As an infectious disease quite dissimilar from COVID-19, the TBDeath model was expected to be different from our proposed model. As parallel in our expectations, it was quite different in terms of their involved variables except for obesity where ‘air pollution’, ‘undernourishment’ and ‘GINI’ are other significant factors rather than our proposed models (table 4). Exceptionally, VIF values of age and schooling variables are large, relatively and still tolerable since their VIF values are not exceeding a threshold of 5.12

Table 4

Parameter testing results for linear regression for predicting ‘deaths due to tuberculosis’


Principal findings

The present study, which aims to explain the heterogeneity in the TotalCase-COV19 and TotalDeath-COV19 worldwide, by using an analysis of the health-related, socioeconomic, climatic and political macro indicators of 170 countries, showed two regression models. Well-established comorbidities, obesity and hypertension, were significant in both these models. Additionally, indicators sunshine and transparency for the TotalCase-COV19 model, as well as indicators alcohol and urbanisation for the TotalDeath-COV19, were important.

Interpretation and comparison with previous studies

Previous studies proposed that the comorbidities (obesity, hypertension, pulmonary disease, etc) contributed to the risk of COVID-19 and progression to severe disease by increasing the expression of ACE2 and/or transmembrane protease serine 2 on host lung cells and heightening the permissiveness of viral infection.13–15 In this study, the TotalCase-COV19 and TotalDeath-COV19 were greater in countries with a higher prevalence of obesity and hypertension. Global analyses of the variability of COVID-19 mortality among countries, as in our study, revealed that TotalDeath-COV19 was positively correlated with deaths due to comorbidities, including cardiovascular, chronic respiratory and kidney diseases, obesity and cancer.5 7 16

The association between sunshine and COVID-19 is arguable in the literature. Studies revealing sunshine hours positively linked to COVID-19 growth claimed that on sunny days, outdoor activities increased along with a decline in adherence to lockdown rules, resulting in increased virus exposure and transmission.17–19 Other studies, on the other hand, suggested that sunny weather could help prevent the spread of COVID-19 by minimising air pollution and increasing vitamin D production.20–22 Those studies revealed that ultraviolet light from the sun can inactivate SARS-CoV-2, reduce outdoor transmission or increase immune resistance through vitamin D production. These results are supported by our research, which found a high TotalCase-COV19 in countries with low sunshine.

Public corruption and GINI reflecting corruption in the flow of products, resources and services inside a nation were found to be associated with the TotalCase-COV19 and TotalDeath-COV19.23 24 The present study further supports these studies by indicating that there are more TotalCase-COV19 in countries with higher transparency (ie, lower corruption) levels. Thus, not only socioeconomic indicators but also political indicators could play a critical role in the COVID-19 outbreak. Lack of access to reliable sources of information, as well as misinformation and inadequate communication, might lead to people disregarding government health alerts. Policies that promote the effective distribution of government budgets in public goods and services like healthcare and education, as well as policies that encourage transparency and information flow, could serve information accuracy about COVID-19’s spread.23 25

There is no country-level analysis demonstrating an association between alcohol consumption and the TotalCase-COV19 and TotalDeath-COV19 in the literature. Recent studies have emphasised that increasing alcohol consumption is linked to excessive production of proinflammatory cytokines by hepatic cells, resulting in higher levels of inflammatory markers (interleukin-8 and tumour necrosis factor-α), which has also been observed in patients with COVID-19.26–28 As a result, it was hypothesised that acute and chronic alcohol consumption might suppress the immune system, leading to reduced resistance to acute respiratory disease and SARS-CoV-2 infection, as well as facilitating the progression of COVID-19.29 30 The present study supports these findings, demonstrating that TotalDeath-COV19 is higher in countries with high alcohol consumption.

Urbanisation has the potential to exacerbate disease and mortality by creating a slew of challenges such as public transportation, health and economic disparities, substandard living facilities, insufficient freshwater supplies, and ineffective sanitation and ventilation systems.4 6 Existing literature has shown that dense urban populations could cause less social distance, more trade and more human mobility, leading to multiple infection routes and more rapid spread of COVID-19.31 32 However, no significant relation was observed between urbanisation and TotalCase-COV19, whereas some of the best subset models involve it. One explanation for this may be the difficulty of identifying infected people in countries with low socioeconomic status due to inadequate test conditions for COVID-19. Therefore, the effect of urbanisation on the TotalCase-COV19 can be hard to observe. Besides that, developed countries with a larger urban population and advanced healthcare systems have longer life expectancies and a higher percentage of the population over 65 years, which could lead to higher mortality rates in high-income countries.8 Moreover, as tougher restriction policies for elderly people have been implemented, urbanisation might be a significant factor in TotalDeath-COV19 rather than TotalCaseCOV-19. Our study thus showed that countries with higher urban populations have higher TotalDeath-COV19 in line with other global COVID-19 studies.6 33

Best subset regression analyses show that heterogeneity in the TotalCase-COV19 and the TotalDeathCOV-19 across countries may be attributed to variations in macro indicators including obesity and hypertension, urbanisation, schooling, transparency, democracy and HDI, as well as alcohol consumption and sunshine. Additionally, while the elderly population, GINI and undernourishment were related to the TotalDeathCOV-19, hypercholesterolaemia was related to the TotalCase-COV19. While the elderly population was a major parameter in nearly all TotalDeath-COV19 models in our analysis, it was left out of the proposed models because it suppresses other estimators.

Other worldwide and national studies suggested that socioeconomic, political, climatic, environmental and ecological factors are correlated with the TotalCase-COV19 and the TotalDeathCOV-19 at different contexts and levels.4–8 34–37 These studies revealed that countries with a high elderly population and level of democracy, as well as low levels of schooling and human development, had the highest mortality rate. Socioeconomic inequalities like low socioeconomic status, inadequate schooling, limited access to healthcare, and income inequality induce difficulty in accessing and affording healthy food, use of tobacco and alcohol, low physical activity, and lower use of preventive medicine. This situation might increase the prevalence of comorbidities by altering the gut microbiome, increasing local inflammation and compromising immunity. Hence, the high-risk population becomes more susceptible to COVID-19.2

To control the COVID-19 models, we perform a regression analysis to estimate the TBDeath consisting of the same macro-scale estimators. In this regression model, tuberculosis is associated with GINI, air pollution, undernourishment, age, democracy, schooling and obesity following the literature.38–40 There were remarkable differences between our COVID-19 models and tuberculosis as we expected since they are different infections and they should have their own (disease-specific) characteristics.


There are certain limitations of the present study. First, not all health-related, socioeconomic, climatic and political indicators were up to date like COVID-19 deaths and cases in 2020. At that point, it is reasonable to expect that macro-scale indicators would not change significantly over the few years. Second, just 139 countries out of 170 were involved in the best subset regression analyses since only these countries had complete data. Third, definitions of COVID-19 cases and illness of people based on symptoms and diagnosis varied by country over a certain time that led some country data on the number of cases to be biased, and so it increases the residual in the model. Additionally, some interruptions on data flow about COVID-19 deaths and cases from countries could lead the model to be biased. To some extent, these error impacts (ie, bias) were minimised by standardising the data’s relative position to other countries’ data by considering overall mean and SD as well. In deciding the cause of death, a similar condition occurred in that some deaths could be recorded such as due to respiratory problems, other infections or multiple organ failure. Finally, the reliability of the PCR test that is used for diagnosing COVID-19 cases as the most preferred procedure across the world could also be regarded as an important limitation of the study since false positive and false negative results of PCR test might cause deviations both in the COVID-19 cases and models.

Conclusion and implications of findings

In conclusion, this original study reveals that health-related, as well as social, economic, climatic and political macro indicators, have been noteworthy in the COVID-19 outbreak. The findings of this study do not claim that these macro indicators cause disease or death directly because it does not conduct any clinical or laboratory research. Essentially, the present study, demonstrating that these macro indicators explain the variation in the number of COVID-19 cases and deaths across 170 countries, may serve as a basis for future clinical trials. Additionally, if it is assumed that COVID-19 would have a lasting effect in the world for an extended period, this study might assist policymakers in developing short-term and long-term health and social strategies to enhance these factors associated with COVID-19. Following the complete vaccination in nations, further research can be done to show how the vaccine affects these models.

Data availability statement

Data are available in a public, open access repository. The data that support the findings of this study are openly available in Open Science Framework at

Ethics statements

Patient consent for publication

Ethics approval

Ethical committee approval was received from İzmir Katip Çelebi University/Non-interventional Clinical Researches Ethics Board (protocol number: 2021-GOKAE-0287).


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors SE is responsible for the overall content as guarantor. SE and SA had full access to all of the data in the study and take responsibility for the integrity and the accuracy of the data analysis. Concept and design—SE, SA and FI. Acquisition, analysis or interpretation of data—SE and FI. Drafting of the manuscript—SE and FI. Critical revision of the manuscript for important intellectual content—SE and FI. Statistical analysis—SE. Administrative, technical or material support—FI, AB, VG, EE, SM and AA. Supervision—SA.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.