Importance of sex and gender factors for COVID-19 infection and hospitalisation: a sex-stratified analysis using machine learning in UK Biobank data

Objective To examine sex and gender roles in COVID-19 test positivity and hospitalisation in sex-stratified predictive models using machine learning. Design Cross-sectional study. Setting UK Biobank prospective cohort. Participants Participants tested between 16 March 2020 and 18 May 2020 were analysed. Main outcome measures The endpoints of the study were COVID-19 test positivity and hospitalisation. Forty-two individuals’ demographics, psychosocial factors and comorbidities were used as likely determinants of outcomes. Gradient boosting machine was used for building prediction models. Results Of 4510 individuals tested (51.2% female, mean age=68.5±8.9 years), 29.4% tested positive. Males were more likely to be positive than females (31.6% vs 27.3%, p=0.001). In females, living in more deprived areas, lower income, increased low-density lipoprotein (LDL) to high-density lipoprotein (HDL) ratio, working night shifts and living with a greater number of family members were associated with a higher likelihood of COVID-19 positive test. While in males, greater body mass index and LDL to HDL ratio were the factors associated with a positive test. Older age and adverse cardiometabolic characteristics were the most prominent variables associated with hospitalisation of test-positive patients in both overall and sex-stratified models. Conclusion High-risk jobs, crowded living arrangements and living in deprived areas were associated with increased COVID-19 infection in females, while high-risk cardiometabolic characteristics were more influential in males. Gender-related factors have a greater impact on females; hence, they should be considered in identifying priority groups for COVID-19 infection vaccination campaigns.


INTRODUCTION
The novel coronavirus SARS-CoV-2 (COVID- 19) pandemic has led to more than 250 million reported positive cases and over 5 million deaths worldwide as of November 2021. 1 As vaccination is rolling out, continuous efforts are made to establish risk factors for the disease and find vulnerable populations across the globe. [2][3][4][5][6][7][8] There has been an imbalance in infection susceptibility, severity, and mortality between males and females. 9 These differences are multifactorial and have been attributed to a combination of biological (ie, genetic, hormonal) and psycho-socio-cultural differences (gender). [9][10][11][12][13] While 'sex' refers to a set of biological attributes in humans and animals, 'gender' refers to the roles, behaviours and identities of individuals that form throughout life. 14 15 It is increasingly recognised that both sex and gender play significant roles in health outcomes, 14 15 including the acquisition of infections and response to infection and

STRENGTHS AND LIMITATIONS OF THIS STUDY
⇒ A unique feature of the study is the investigation of numerous psycho-socio-cultural factors in conjunction with clinical and laboratory factors made feasible through machine learning algorithms. ⇒ The assessment of sex-stratified algorithms to elucidate the most influential factors in both sexes adds novel information as it has rarely been done to date. ⇒ The first limitation of this study is selection bias due to the lack of systematic and random testing across the UK. ⇒ This analysis was done using the baseline diagnostic data that were collected from 2006 to 2010; therefore, misclassification of determinants is possible. However, previous studies of the UK Biobank have shown a high correlation between baseline and follow-up data for a subsample of patients who had further visits for imaging. ⇒ Finally, the relatively low predictive model performance (test positivity: area under curve (AUC): 0.570 (95% CI: 0.537 to 0.604), hospitalisation: AUC: 0.60 (95% CI: 0.534 to 0.665))-as expected-reflects other influences not captured in the UK Biobank.
Open access treatments. 16 The WHO statement, 16 addressing sex and gender in epidemic-prone infectious diseases, outlines that differences between males and females can lead to differences in activity patterns in work and in family roles, which may increase the risk of exposure to infectious disease in a particular setting. 16 Therefore, lack of consideration of the influences of sex and gender in COVID-19 contraction can potentially hinder the effectiveness of COVID-19 vaccination prioritisation strategies.
With the introduction of the COVID-19 vaccines, identifying those most vulnerable to infection with accurate prediction models that account for more complex relationships is urgent. Machine learning algorithms can explore non-linear and complex relationships by considering the interaction between both biological and psychosocio-cultural factors together; however, these methods are still underused in the medical field, and few predictive models have been tailored to each sex. [17][18][19] Therefore, we examined sex-related and gender-related factors associated with SARS-CoV-2 test positivity and COVID-19 hospitalisation in the UK Biobank (UKB) cohort and developed sex-stratified predictive models using machine learning methods.

Study population
This is a cross-sectional study of UKB data. The UKB (https://www.ukbiobank.ac.uk/) is a prospective cohort study that collects health, lifestyle, genetic and imaging data for over 500 000 randomly selected participants in the UK. 20 Baseline data collection took place between 2006 and 2010 across England, Scotland and Wales. The age of participants at recruitment was between 40 and 69 years. 20 Data were collected in 22 assessment centres through four main methods: touchscreen questionnaires, verbal interviews, physical measures and biological sampling. 20 For this study, only data from England between 16 March 2020 and 18 May 2020 were used due to the unavailability of COVID-19 test results from Scotland and Wales at the time of analysis.

Patient and public involvement
No patient involved in the design, or conduct, or reporting, or dissemination plans of our research.

COVID-19 test results
COVID-19 test results were available from Public Health England data for 4510 participants, which was linked to UKB baseline data. 21 The primary endpoint of the study was test positivity. We defined testing positive as having at least one positive test result. The test results were available from 16 March 2020 to 18 May 2020. The secondary endpoint of the study was being hospitalised for a COVID-19-related illness. For this purpose, we chose test-positive patients who had at least one positive result in an inpatient setting. The results are from samples taken from the combined nose, throat, sputum or lower respiratory tract.
The analysis for infection was done using real-time PCR tests for SARS-CoV-2.

Baseline characteristics
Patients' demographics, psychosocial (gender-related variables), anthropometric variables and comorbidities that were collected at baseline 2006-2010 were used for analysis. We used this particular baseline data for socioeconomic status (SES) and occupation since studies have shown a high correlation between baseline and follow-up data for a subsample of patients who had further visits for imaging. 22 Gender-related psychosocial variables A multistep approach for identifying gender-related variables was exploited based on GOING-FWD (Gender Outcomes INternational Group: to Further Well-being Development) methodology, 23 the Women Health Research Network's gender framework (ie, gender identity, gender roles, gender relationships and institutionalised gender), and the Canadian Institutes of Health Research sex and gender modules. 14 15 24 Based on this approach and data availability, the following variables were selected: multiple deprivation indices (online supplemental file 1), employment, jobs involving night shift, education level, smoking status, alcohol consumption, number of children, household crowding, housing ownership, income, leisure and social activities, risk-taking behaviours and neuroticism score. Details of the variables are available in the UKB data dictionary (https:// biobank.ctsu.ox.ac.uk/crystal/search.cgi).

Comorbidities
Baseline comorbidities were self-reported and collected using a touch screen device-that is, hypertension, diabetes, chronic obstructive pulmonary disease, asthma, allergy, history of stroke, heart attack, angina, deep vein thrombosis, pulmonary embolism and cancer. In addition, the number of medications and long-standing illness, disability, or infirmity were also included and coded as dichotomous variables. These variables were selected based on their significance as demonstrated by a number of previous investigations on the UKB. 2 3 5-8 18 22 25-27 Physical and biological characteristics Variables used as measures of physical and biological characteristics were body mass index (BMI) (defined as weight (kg)/height (m) 2 ), waist to hip ratio (WHR), levels of vitamin D, haemoglobin A1c (HbA1c), highdensity lipoprotein (HDL) and low-density lipoprotein (LDL); all coded as continuous variables.

Statistical analyses
Descriptive statistics were reported as mean and SD for continuous variables and frequency and percentage for categorical variables. Group-based differences (test negative vs positive) in baseline characteristics were compared using independent Student's t-test for continuous variables (or its non-parametric counterpart for skewed Open access distributions) and Χ 2 for categorical variables. P values of less than or equal to 0.05 were considered statistically significant. A complete case analysis (pairwise deletion) approach was used for dealing with missing data in the descriptive analysis. Bonferroni type adjustments were used to correct for multiple testing and the results after adjustment were used for interpretation. Data analysis was performed using R-Studio (V.1.3.1093) and R soft-

Machine learning
The data were split into 70% training and 30% test sets, where the test set was only used for the evaluation of the final models. The training set was used to develop gradient boosting decision tree models 28-30 using the 'gbm3' R package for predicting SARS-CoV-2 test positivity and hospitalisation prediction. To reduce the effect of class imbalance on model development, 31 a bootstrap oversampling approach was used. For each endpoint, that is, test positivity or hospitalisation, three different models were developed: (1) a model for males, (2) a model for females, and (3) an overall model for males and females combined. Calibration was then performed using isotonic regression method. Using k-fold cross-validation is considered a best practice in developing machine learning models. 32 Therefore, for developing each model, a 10-fold cross-validation was performed on the training set. We also used a grid search procedure to find the best combination hyperparameters (eg, learning rate, interaction depth, bagging fraction and the minimum number of observations in terminal nodes) (online supplemental file 1) using fivefold crossvalidation on the training dataset and the area under the receiver operating characteristic curve (AUROC) metric as the criterion. A Bernoulli distribution was used for classification models. These resulted in three trained models for SARS-CoV-2 test positivity and three models for SARS-CoV-2 hospitalisation.
To provide an unbiased estimate of the model generalisation errors, the performance of the trained models was assessed and reported on the test set. Confusion matrixderived metrics including accuracy, precision, recall (sensitivity), specificity as well as area under curve (AUC) score were used as the performance measures. 33 We used the best threshold of receiver operating characteristic curve (ROC) as cut-off for determining precision/recall and sensitivity/specificity. Another metric which focuses on predictions of the positive class is the Area Under the Precision-Recall Curve (AUPRC). 34 Interpretation of AUPRC is dependent on the class distribution of the outcome as the minimal achievable value is dependent on that distribution, 35 and the AUPRC value of a random classifier is the rate of the positive class. 35 Most influential variables For identifying the most influential variables, permutation-based feature importance was used. 36    Open access calculating the increase of the model's prediction error after permuting the feature. We reported partial dependence plots (PDPs) using the 'pdp' package in R to understand the marginal effect of a feature on the predicted outcome. PDP demonstrates how the response variable changes as we change the value of a feature while taking into account the average effect of all the other features in the model. 37 The Y axis shows how the predicted value changes with change in predictor variables. If the line in the plot is constant near zero, it means that the variable has no effect on the model. A negative value means that a specific value of the predictor variable is less likely to predict the correct class of outcome, whereas a positive value means the predictor variable has a positive impact on predicting the correct class. 38 A locally estimated scatterplot smoothing line is fit to show the trend.

RESULTS
Of 4510 patients (51.2% females, and 68.5±8.88 years) who were tested, 29.4% were positive. Females were less likely to be positive (males: 31.6% vs females: 27.3%, p=0.001). In descriptive analyses, there was a difference in age between test-positive and test-negative individuals (p<0.001); specifically, those younger than 60 years (test positive vs negative: 28% vs 21.8%) and those older than 80 years (test + vs −: 6.2% vs 5.7%). Similarly, there was significant difference in test positivity among ethnicities (minority ethnicity: test + vs −: 13.2% vs 7.6% p<0.001)  Figure 1 Overall results: partial dependency plots for predicting test-positive results for first 10 most influential variables using permutation methods in the overall population using shrinkage=0.01, bag fraction=0.5, interaction depth=5, cross-validation fold=10, train fraction=0.7 and n.minobsinnode=10 as hyperparameters from grid search. The X axis is the predictor variable in the model. The Y axis shows how the predicted value changes with change in predictor variables. If the line in the plot is constant near zero, it means that the variable has no effect on the model. A negative value means that a specific value of the predictor variable is less likely to predict the correct class of outcome, whereas a positive value means the predictor variable has a positive impact on predicting the correct class. A locally estimated scatterplot smoothing line is fit to show the trend.
The number beside each variable shows the order of feature importance and most influential variables for each model. For identifying the most influential variables, permutation-based feature importance was used. This approach measures a feature (variable) importance by calculating the increase of the model's prediction error after permuting the feature. HDL, high-density lipoprotein; LDL, low-density lipoprotein.  Results for females: partial dependency plots for predicting test-positive results for first five most influential variables using permutation methods in females using shrinkage=0.01, bag fraction=1, interaction depth=7, cross-validation fold=10, train fraction=0.8 and n.minobsinnode=15 as hyperparameters from grid search. HDL, high-density lipoprotein; LDL, low-density lipoprotein.

Figure 3
Results for males: partial dependency plots for predicting test positive results for first five most influential variables using permutation methods in males using shrinkage=0.001, bag fraction=1, interaction depth=3, cross-validation fold=10, train fraction=0.8 and n.minobsinnode=5 as hyperparameters from grid search. BMI, body mass index; HDL, high-density lipoprotein; LDL, low-density lipoprotein.

Open access
Machine learning-based prediction models for SARS-CoV-2 positive test The AUCs for test positivity in the overall model, male and female-specific models were 0.570 (95% CI: 0.537 to 0.604), 0.575 (95% CI: 0.529 to 0.621) and 0.561 (95% CI: 0.512 to 0.609), respectively. The performance of the gradient boosted decision tree models is summarised in table 3. Figures 1-3 illustrate the order of variable importance and partial dependence plots used for interpreting the results and direction of each variable in the models. The prediction models for the overall study population suggest that an increased LDL to HDL ratio, WHR and age were associated with a higher likelihood of test positivity. Additionally, individuals who worked night shifts or lived in a more deprived area (lower environment and education scores) as well as those participating in social activities-including sports clubs, bars and religious groups-had a higher risk of having a positive test. In contrast, individuals with higher education levels, higher income, and those with daily or almost daily alcohol consumption were less likely to have a positive result.
The sex-specific models for test positivity showed that gender factors were more important in females, whereas in males, biological factors were significant contributors to test positivity. Females who lived in more deprived areas (increased environment score), had increased LDL to HDL ratio, worked night shifts and had a greater number of family members in their household were more likely to test positive. Moreover, those with income greater than 100 000 were less likely to test positive (figure 2). In comparison, males with greater BMI and LDL to HDL ratio, more deprived area (greater score) and black British ethnicity were more likely to test positive (figure 3).

Machine learning-based prediction models for COVID-19related hospitalisation
The AUCs for hospitalisation in test-positive patients in overall, male, and female-specific models were 0.60 (95% CI: 0.534 to 0.665), 0.544 (95% CI: 0.453 to 0.635), and 0.612 (95% CI: 0.532 to 0.692), respectively. The performance of the gradient boosted decision tree models is summarised in table 4. Among the 1326 test-positive

Open access
patients, 932 (70.3%) were hospitalised (females: 413 (44.3%)). Figures 4-6 illustrate the order of variable importance and partial dependence plots used for interpreting the results and direction of each variable in the models. The result of the overall model to predict hospitalisation in test-positive patients revealed that those with higher HbA1c level, older age, greater BMI, higher LDL to HDL ratio and greater number of medications had greater risk of being hospitalised ( figure 4).
The sex-stratified model revealed that older age, a higher level of HbA1c, LDL to HDL ratio, a greater number of medications and higher housing score (showing more deprived areas) were most influential variables in predicting hospitalisation in test-positive females (figure 5); whereas older age, an increased HbA1c level, WHR, LDL to HDL ratio and BMI were the most influential variables associated with hospitalisation in test-positive males (figure 6).

DISCUSSION
The present evaluation of individuals tested for SARS-CoV-2 in a UK cohort demonstrates the importance of gender-related factors along with clinical characteristics in predicting COVID-19 test positivity, hence providing guidance for identifying vaccination priority groups in the general population. While factors related to the gender role of individuals were the most influential determinants in females, cardiometabolic risk factors played a key role in males. Such a sex-specific cluster of factors associated with adverse outcomes was attenuated when considering the rate of COVID-19-related hospitalisation among test-positive individuals. Notably, older age and cardiometabolic diseases, including diabetes, obesity and dyslipidaemia, were most influential regardless of sex.
Emerging evidence has shown sex differences in contracting and severity of the infection. While most investigations have focused on biological factors as the potential culprit, few have incorporated gender determinants. Various modulating mechanisms have been suggested, including genetic factors (hormone-regulated expression of genes), the difference in innate and adaptive immune responses, as well as gendered factors such as lifestyle, behavioural and psychosocial factors. 11 13 39 Our findings reinforce and advance the current evidence related to the impact of metabolic comorbidities and older age in the COVID-19 pandemic. Obesity has been linked with impaired pulmonary function and suppression of immune response and has been recognised as one of the most important factors in contracting COVID-19 infection, the severity of the disease and mortality. 4 40 Higher BMI and WHR are some of the more influential measures that have shown a dose-response relationship with test positivity and disease severity. 4 6 40 A study on the UKB data demonstrated a more than 50% increase in COVID-19 infection in obese and severely obese patients compared with non-obese individuals. 40 Investigation of different databases in various countries have also reported older age and comorbidities as the most important factors associated with clinical severity and mortality. 41 Specifically, diabetes has been associated with more severe disease manifestation. 42 This is consistent with our results, which demonstrated HbA1c level as one of the most prominent factors for hospitalisation in the overall and sex-stratified models. Preliminary studies have demonstrated the association between HDL level, infection and disease severity. While the mechanism of this correlation is unknown, this can be due to the antioxidant, antithrombotic and anti-inflammatory role of HDL cholesterol. 43 A major and novel contribution of this study is the application of a sex-oriented and gender-oriented lens to inform the understanding of COVID-19 infection by conducting sex-disaggregated analyses and incorporating both sex-related and gender-related determinants. Studies have shown socioeconomic disadvantages such as living in a more deprived area and lower education to be associated with increased risk of infection and disease severity 2 4 ; these factors prevail in individuals more likely to work in service-based occupations, be self-employed or live in crowded households. 3 4 44 Moreover, less access to healthcare is another factor that leads to greater infection risk and worst outcomes. By the same token, our study demonstrated that those who live in a more deprived area with lower SES and education were more susceptible to COVID-19 infection, but the impact of gender determinants was more significant among females. Women's role as caregivers within family and society increases their risk of infection. 39 Moreover, women are more likely to work as frontline workers, including nursing positions, increasing their exposure to the virus. 3 22 39 In our current study, environment score, household arrangement, working night shifts and income were among the most important factors for females. In contrast, obesity, LDL to HDL ratio and alcohol consumption were among the most influential factors in men.
The results of this study serve as an important guide for vaccination prioritisation policies. While essential workers and elderly individuals have already been targeted for vaccination, the next step will be the identification and prompt vaccination of individuals in higher risk groups in the general population. Although factors such as diabetes and obesity might be important, psychosocial risk factors such as lower SES, education level, living in a more deprived environment, risky occupation and household crowding should also be taken into account. Individuals, especially females, exhibiting such high-risk gendered factors should be prioritised for vaccination.

Strengths and limitations
A unique feature of the study is the investigation of numerous lifestyles, socioeconomic, mental and behavioural factors representing different dimensions of gender in conjunction with clinical and laboratory factors made feasible through machine learning algorithms. Furthermore, the assessment of sex-stratified algorithms to elucidate the most influential factors in both sexes adds novel information as it has rarely been done to date.
The study should also be interpreted considering some limitations. The first limitation of this study is selection bias due to the lack of systematic and random testing across the UK. Second, this analysis was done using the baseline data that were collected from 2006 to 2010 and not the ones done at the time of COVID-19 infection diagnosis. Therefore, misclassification of determinants is possible. However, previous studies of the UKB have shown a high correlation between baseline and follow-up data for a subsample of patients who had further visits for imaging. 22 Moreover, a disproportionately higher representation of Caucasians in the study may make the results less generalisable to other ethnic groups. Finally, though we attempted to test various clinical, social and demographic factors to predict test positivity, it is essential to note that acquisition of infection is a multifactorial phenomenon that cannot be easily encoded using a small set of variables. Our relatively low predictive model Open access performance-as expected-reflects other influences not captured in the UKB. Similar results were obtained using the XGboost method from a recent study on the UKB dataset whereby slightly superior performance to our gradient boosting machine model was obtained, which further supports the interpretation that this is the expected accuracy for predictive models on this dataset. 18 While similar features were obtained for predicting severity (hospitalisation and fatality) in this study, combining mortality with hospitalisation for assessing severity can justify better model performance compared with only hospitalisation in our models. The difference observed in the performance of our model compared with the aforementioned study may also be explained by the lower power and heterogeneity in our sample. Moreover, since the predictive accuracy of the model is slightly low, the risk factors deduced may not be strong enough to predict the outcomes.
Additionally, with the emergence of the dominant delta and omicron variants, further studies are needed to elucidate the risk factors, though we suspect them to remain the same. Finally, the small sample size for this analysis may limit generalisability.

CONCLUSIONS
Sex-specific risk patterns of COVID-19 test positivity exist, with gender-related factors being more relevant in females and biological factors in males. Specifically, SES, education level, number of people living in a household and high-risk jobs were associated with a higher likelihood of contracting the infection in female individuals, whereas cardiometabolic disease and obesity were more associated in males. The rate of COVID-19-related hospitalisation recognised similar favouring clinical factors regardless of sex. This study highlights the importance of prioritising high-risk groups using psychosocial determinants along with clinical factors as a targeted approach for vaccination of more at-risk population to contain the SARS-CoV-2 pandemic.