Article Text

Original research
Predictors of incident SARS-CoV-2 infections in an international prospective cohort study
  1. Anthony Lin1,
  2. Eric Vittinghoff2,
  3. Jeffrey Olgin1,
  4. Noah Peyser1,
  5. Sidney Aung1,
  6. Sean Joyce1,
  7. Vivian Yang1,
  8. Janet Hwang1,
  9. Robert Avram1,
  10. Gregory Nah1,
  11. Geoffrey H Tison1,
  12. Alexis Beatty1,
  13. Ryan Runge1,
  14. David Wen1,
  15. Xochitl Butcher1,
  16. Cathy Horner1,
  17. Helena Eitel1,
  18. Mark Pletcher2,
  19. Gregory M Marcus1
  1. 1Department of Medicine, University of California San Francisco, San Francisco, California, USA
  2. 2Epidemiology and Biostatistics, University of California San Francisco, San Francisco, California, USA
  1. Correspondence to Dr Gregory M Marcus; Greg.Marcus{at}


Objective Until effective treatments and vaccines are made readily and widely available, preventative behavioural health measures will be central to the SARS-CoV-2 public health response. While current recommendations are grounded in general infectious disease prevention practices, it is still not entirely understood which particular behaviours or exposures meaningfully affect one’s own risk of incident SARS-CoV-2 infection. Our objective is to identify individual-level factors associated with one’s personal risk of contracting SARS-CoV-2.

Design Prospective cohort study of adult participants from 26 March 2020 to 8 October 2020.

Setting The COVID-19 Citizen Science Study, an international, community and mobile-based study collecting daily, weekly and monthly surveys in a prospective and time-updated manner.

Participants All adult participants over the age of 18 years were eligible for enrolment.

Primary outcome measure The primary outcome was incident SARS-CoV-2 infection confirmed via PCR or antigen testing.

Results 28 575 unique participants contributed 2 479 149 participant-days of data across 99 different countries. Of these participants without a history of SARS-CoV-2 infection at the time of enrolment, 112 developed an incident infection. Pooled logistic regression models showed that increased age was associated with lower risk (OR 0.98 per year, 95% CI 0.97 to 1.00, p=0.019), whereas increased number of non-household contacts (OR 1.10 per 10 contacts, 95% CI 1.01 to 1.20, p=0.024), attending events of at least 10 people (OR 1.26 per 10 events, 95% CI 1.07 to 1.50, p=0.007) and restaurant visits (OR 1.95 per 10 visits, 95% CI 1.42 to 2.68, p<0.001) were associated with significantly higher risk of incident SARS-CoV-2 infection.

Conclusions Our study identified three modifiable health behaviours, namely the number of non-household contacts, attending large gatherings and restaurant visits, which may meaningfully influence individual-level risk of contracting SARS-CoV-2.

  • COVID-19
  • public health
  • health informatics

Data availability statement

Data are available upon reasonable request. All data relevant to the study are included in the article or uploaded as supplemental information.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Strengths and limitations of this study

  • This large international cohort study with 2.4 million participant-days of data from participants in 99 different countries provides unprecedented geographical diversity for a study analysing individual-level factors associated with risk of SARS-CoV-2.

  • All participants included in this study were free of SARS-CoV-2 infection early in the pandemic, allowing for real-time ascertainment of significant individual-level behaviours and exposures related to higher risk of incident infection.

  • Using PCR or antigen testing as the gold standard for SARS-CoV-2 infections relied on a participant’s development of symptoms, index of suspicion and access to testing facilities, but ensured our study identified risk factors associated with true infection and increased specificity over traditional methods of symptom reporting.


The novel coronavirus (SARS-CoV-2) global pandemic has created a major public health crisis for nearly every country and community in the world. Responses to mitigate transmission have varied by government, but have generally been grounded in known respiratory virus disease prevention practices. Current strategies have included a combination of social distancing, limitations to travel and public gatherings, increased handwashing practices, and use of face masks. While these interventions are believed to reduce human-to-human transmission, efforts to study these interventions have been limited as they rely on individual-level behaviours that are dynamic with policy changes and can be difficult to capture at scale. Furthermore, the politicisation of social distancing recommendations1–3 makes it difficult to fully understand levels of compliance at the individual level and calls for a larger evidence base for recommendations like handwashing, face mask wearing, and limiting human contact, large social gatherings and visits to restaurants. Identifying predictors of infection requires a longitudinal cohort study. The information gleaned from the longitudinal characterisation of SARS-CoV-2 infection risk factors may be crucial to understanding which strategies are most effective and can further inform public policy. Moreover, such data may help elucidate the individual behaviours directly under one’s control to influence one’s personal risk of contracting SARS-CoV-2.

While previous prospective studies have focused primarily on symptom detection and the constellation of symptoms associated with SARS-CoV-2 infection,4–7 mobile technology provides an opportunity to study the effects of various exposures and behaviours that can be ascertained prospectively, repeatedly and in nearly real time. The majority of previous research regarding SARS-CoV-2 has focused on hospitalised individuals, primarily those who already have the disease, and predictors of disease severity as opposed to those pertinent to developing infection. This is not surprising as accumulating sufficient numbers to characterise non-infected individuals at baseline and then follow them over time is generally time-consuming and would require enrolment of particularly large numbers to derive useful results. While systematic reviews and meta-analyses of previous studies have investigated the efficacy of behavioural interventions,8 9 we are not aware of a longitudinal cohort study in which risk factors have been characterised in detail prior to infection and exposures and behaviours tracked as individuals contracted (or did not contract) SARS-CoV-2 in the community.

Given the widespread use of smartphones and associated mobile apps, the technology is now available to regularly query large populations to assess patterns in SARS-CoV-2 infection rates based on individual-level exposures and behaviours. We have previously demonstrated the utility of this technology in characterising ambulatory cardiovascular risk factors.10–14 In this study, we sought to use prospectively collected information from the COVID-19 Citizen Science Study to identify individual characteristics, exposures or behaviours associated with an increased risk of contracting SARS-CoV-2.


Study design

The COVID-19 Citizen Science Study is a mobile application that enables longitudinal and time-updated collection of health survey and location data from thousands of global participants. The application was developed by a team of investigators at the University of California, San Francisco using the Eureka Digital Research Platform. Enrolment began on 26 March 2020 and is ongoing ( The current analysis included participant information collected until 8 October 2020. Enrolment is available to all adults over the age of 18 years and has been facilitated by press releases, social media and word of mouth.

Informed electronic consent was obtained remotely using mobile application at the time of study enrolment.

Data collection

Surveys collected information about demographics, medical comorbidities, SARS-CoV-2 infection status, daily behaviours, environmental or social exposures, and symptoms. Surveys were written in English and met the Flesch-Kincaid criteria for eighth-grade reading level ( Participants received a baseline survey at the time of enrolment ascertaining general demographic information such as age, race/ethnicity, sex, education level, MacArthur subjective social status, occupation, smoking patterns, presence of children or pets at home, and pre-existing medical comorbidities. After completing the baseline survey, participants then received daily surveys that enquired about current symptoms, household contacts and non-household contacts; weekly surveys that assessed changes to individual-level behaviours such as sleep, exercise, social distancing efforts, hand hygiene and use of face masks while out in public; and monthly surveys that collected information regarding employment, mood and alcohol consumption (online supplemental appendix 1).

The MacArthur Subjective Social Status Ladder was used as a previously validated single-item question to capture the socioeconomic status of study participants, with higher point ratings indicating higher subjective social status.15 16 Occupation was dichotomised based on working in healthcare or not. Exercise was defined as self-reported physical activity lasting for at least 20 min and resulted in heavy breathing or ‘break[ing] a sweat’ and was categorised into never/rarely, <1 time/month, <1 time/week, approximately weekly, 2–4 days/week and >4 days/week. Alcohol use was categorised into none, >0–7 standard drinks per week, >7–14 standard drinks per week and >14 standard drinks per week. Smoking activity was differentiated by use of cigarettes, e-cigarettes or marijuana and then dichotomised by any use in the last 30 days or not. Daily contacts were defined as any non-household individual with which the participant was within 1.83 meters (6 feet) of during the course of the day.

Participants were queried regarding PCR or antigen testing at baseline and during the weekly survey. Using triggered logic, related questions distinguished between evidence of active infection with the PCR test from other tests, such as antibody tests (the latter were not considered sufficient to constitute incident infection). All participants who reported a positive PCR or antibody test for SARS-CoV-2 prior to enrolling in the study were excluded from this analysis. Self-reported positive PCR tests for SARS-CoV-2 were validated by contacting a sample of participants and obtaining documentation of test results (online supplemental appendix 2).

Patient and public involvement

The COVID-19 Citizen Science Study, which remains open to any interested adult with a smartphone, was designed to answer questions most relevant to patients and the lay public, with an emphasis on identifying clinically relevant behaviours and exposures that can be modified or influenced by any individual. The study was launched using the National Institutes of Health-supported Eureka Digital Research Platform, which was heavily influenced by prior work designing and implementing the Health eHeart Study17—from the beginning, these studies have included patients as key stakeholders, such as the Patient-Centered Outcomes Research Institute-supported Health eHeart Alliance,18 to assure that the user experience was relatable and understandable to interested participants around the world. Modifications of questions and the basic content of some research questions were derived from participant feedback received ad hoc and as a result of campaigns to solicit novel research questions from participants for incorporation into the study. All participants in the COVID-19 Citizen Science Study are encouraged to help with recruitment, with regular reminders via text messages, push notifications and newsletters to share the link and/or ‘text back’ with friends and family members. Results are disseminated back to COVID-19 Citizen Scientists in the form of data visualisations and text shared via newsletters, the study website and links sent via text message or app-based push notification.

Statistical analyses

Baseline continuous variables are presented using mean and SD or median and IQR, while categorical variables are presented as frequencies (percentages), and compared between participants who reported incident infection and those remaining infection-free using t-tests for continuous variables and χ2 tests for categorical variables. Pooled logistic regression models for repeated SARS-CoV-2 test results self-reported on the weekly surveys were used to identify factors, obtained from the baseline and earlier weekly and daily surveys, associated with incident infection. We considered demographics; pre-existing medical conditions; behavioural contributors such as mask wearing, hand hygiene and social distancing efforts; and individual exposures such as number of non-household contacts, large gatherings, and visits to gyms, restaurants and movie theatres. Exposures from earlier weekly and daily surveys were averaged over measurements obtained 4–21 days prior to the weekly survey providing the SARS-CoV-2 test result. All variables associated with SARS-CoV-2 infection with p values <0.1 in the pooled logistic regression models adjusting for only a three-knot restricted cubic spline in calendar date were included in a fully adjusted pooled logistic regression model. In a sensitivity analysis, backward deletion was used to select a more parsimonious pooled logistic regression model retaining covariates with p values <0.05. These models all used robust SEs to account for clustering of the repeated weekly SARS-CoV-2 test results by participant. Additionally, recognising the importance of geographical location, sensitivity analyses restricted to US participants were performed accounting for clustering by county-based Federal Information Processing System (FIPS) and zip codes. All analyses used complete case data. Two-tailed p values <0.05 were considered statistically significant. All statistical analyses were performed using Stata V.16.


After excluding 628 participants with prevalent SARS-CoV-2 infection, 28 575 individuals without a history of SARS-CoV-2 infection at baseline contributed 2 479 149 participant-days of data to the COVID-19 Citizen Science Study across 99 different countries, including all 50 states in the USA (figure 1). The mean proportion of participants who completed at least one health survey during a study week was 88.6%±5.0% and the mean proportion of participants who completed at least one health survey during a study month was 98.1%±1.6% (online supplemental tables 1 and 2). Of the total study population, 112 participants (0.4%) developed a SARS-CoV-2 infection during the study period. Differences in participant demographics, baseline comorbidities, behaviours and exposures between participants who became infected during the study period and those who did not are displayed in table 1.

Figure 1

Location of all study participants. The blue shading represents the number of participant-days by county within the USA and by nation in the world. The red shading illustrates all participants infected by SARS-CoV-2 during the study period.

Table 1

Demographics, comorbidities and behavioural risk factors of participants in the COVID-19 Citizen Science Study assessed at the time of enrolment, divided by participants who later tested positive for COVID-19 during the study period and participants who did not

After adjusting only for age, sex, race/ethnicity and calendar date, older age, higher education level, higher subjective social status and increased alcohol use were associated with lower risk, while working in healthcare, a history of HIV, e-cigarette use, less exercise frequency, increased number of recent contacts, attending gatherings with at least 10 people, and visiting movie theatres and restaurants were each associated with a higher risk of incident SARS-CoV-2 infection (table 2). Importantly, pertinent factors that failed to exhibit statistically significant relationships included common medical comorbidities like hypertension, diabetes, coronary artery disease, congestive heart failure, atrial fibrillation, asthma or chronic obstructive pulmonary disease, as well as handwashing practices and mask wearing frequency. Pooled logistic regression models that incorporated all eligible predictors showed that increased age was associated with lower risk of developing a SARS-CoV-2 infection, whereas increased number of contacts, attending events of at least 10 people and visits to restaurants were associated with significantly higher risk of later testing positive for SARS-CoV-2 (figure 2). Backward stepwise deletion did not change any of the statically significant relationships (online supplemental table 3). Similarly, the sensitivity analysis using county-based FIPS and zip codes as random effects in USA-based data did not meaningfully change the results (online supplemental tables 4 and 5).

Figure 2

Forest plot of all eligible predictors in pooled logistic regression models. Higher scores in the MacArthur Subjective Social Status Ladder reflect participants with self-reported higher socioeconomic standing. Large gathering was defined as any gathering in which 10 or more people were present. The reference group for predictors marked with an asterisk (*) was compared with non-Hispanic white participants.

Table 2

Minimally adjusted odds of incident SARS-CoV-2 infection


Among an international cohort free of SARS-CoV-2 at baseline and tracked longitudinally, prospectively and in a time-updated manner, increased number of daily non-household contacts within 1.83 meters (6 feet), events of 10 or more individuals and restaurant visits each independently predicted a higher risk of developing SARS-CoV-2 infection. Increased age was associated with a lower risk of subsequently developing SARS-CoV-2 infection.

As of 22 March 2021, there have been over 123 million confirmed cases of SARS-CoV-2 and over 2.7 million SARS-CoV-2-related deaths worldwide.19 The pandemic has been exacerbated by a recent resurgence of a ‘second wave’ of SARS-CoV-2 cases and confirmation of new strains with potentially increased transmissibility. The pandemic has spurred international efforts to improve testing capabilities,20 identify therapies to treat the novel coronavirus21 and develop vaccines designed to prevent it.22 23 Even as vaccines from biopharmaceutical companies like Pfizer and Moderna are being delivered, distribution to members of the public has been slow in nearly every country and community, with only countries like Israel, the United Arab Emirates, Chile and the UK managing to administer at least 40 vaccine doses per 100 people.24 Until and if production, distribution, administration and acceptability of approved vaccines can satisfy the overwhelming need throughout the international community, the identification of preventative health behaviours under an individual’s control is crucial to the SARS-CoV-2 public health response.

The COVID-19 Citizen Science Study launched on 26 March 2020 and has been ongoing while recommendations to limit disease transmission continue to evolve at variable rates across the globe. The study has been prospectively collecting data through the initial shelter-in-place recommendations in early 2020 and continues to capture changes in behavioural health patterns as the second spike of SARS-CoV-2 infections surmounts. Our study observed an increased association of SARS-CoV-2 infection in individuals who reported higher numbers of recent contacts. In a similar vein, increased attendance of events of 10 or more people and restaurant visits were associated with increased odds of developing SARS-CoV-2 infection. Given our general understanding of disease transmission for respiratory viruses and recent research characterising the asymptomatic transmission of SARS-CoV-2,25 26 these findings are bolstered by biological plausibility. They add to previous research supporting the use of government-mandated physical distancing policies to reduce SARS-CoV-2 incidence27 28 and suggest that behaviours to minimise human-to-human interaction could be effective means to lower one’s individual risk of contracting SARS-CoV-2. To our knowledge, this is the first longitudinal cohort to determine that such behaviours among individuals prior to infection actually influence risk.

While the lower risk among older individuals may at first glance appear counterintuitive, this may be consistent with similar protective behaviours and compliance with social distancing behaviours, especially given data reporting high incidence of SARS-CoV-2 in nursing homes29 as well as disproportionately higher rates of hospitalisation and death in older populations infected with SARS-CoV-2.30 31 If such phenomena were operative, the fact that we were unable to detect differences in such behaviours (such as significant relationships between hand hygiene or mask wearing) may be due to collinearity with age and/or suboptimal ascertainment of the actual protective approaches used by older individuals. Also contrary to most reports, medical comorbidities thought to increase one’s risk of morbidity and mortality from SARS-CoV-2,32 33 such as hypertension, diabetes, congestive heart failure, chronic obstructive pulmonary disease, cancer and history of myocardial infarctions, were not retained predictors in our multivariate models, suggesting that prior comorbidities may affect one’s response to SARS-CoV-2, but may not play a large role in an individual’s risk of contracting SARS-CoV-2.

While previous studies have observed benefits in universal masking at the community level,34 35 our study did not reveal a clear association between an individual’s mask wearing behaviour and their risk for SARS-CoV-2 infection. Similarly, self-reported frequency of handwashing did not seem to consistently correlate with SARS-CoV-2 incidence as well. Simple frequencies of mask wearing and handwashing behaviours may be too confounded or measured too imprecisely to observe a consistent trend in our data. Additionally, the higher prevalence of healthcare workers in the study population may have resulted in participants having higher rates of mask wearing and handwashing, but also higher risk of infection, thereby degrading any associations between predictor and outcome. As such, these negative results should be interpreted cautiously in the context of the study design, and insufficient power may render negative results (or lack of associations) less informative than the statistically significant relationships (positive results) that have been observed thus far (even if in the absence of a longitudinal cohort with time-updated assessments as described here).

Our study has a number of important limitations to note. While focusing on individual-level behaviours mitigated issues involving compliance compared with studies examining state-level or country-level government mandates, self-report is still a subjective process and still prone to bias based on differing definitions of qualitative words (ie, ‘sometimes’ vs ‘most times’). However, health survey data were ascertained prospectively and time-updated daily and weekly to minimise recall bias, and self-report remains likely the most effective method to ascertain individual-level behaviours. As the study required smartphone ownership and use, it is possible that the COVID-19 Citizen Science Study participants represent a more affluent and more technologically savvy population compared with the general population. Although this would limit generalisability instead of internal validity, our diverse recruitment methods were meant to mitigate risks of sampling bias. The distribution of study participants throughout nearly 100 different countries and every state in the USA provides fairly unprecedented geographical diversity for a study that also ascertains participant-reported behaviours. There are an innumerable number of behaviours that could have been asked on surveys; we limited our questioning to behaviours previously identified by national and international health organisations and/or those with some biological plausibility as effective means of prevention, such as social distancing, handwashing and use of face masks. While PCR testing for SARS-CoV-2 relies on a participant’s development of symptoms, index of suspicion and available access to a testing facility, all factors that may have led to under-reporting of all SARS-CoV-2 infections in the study population, the use of these tests to identify SARS-CoV-2 infections ensured that our analyses identified risk factors associated with true infection and increased specificity over traditional methods of symptom reporting. Because identification of predictors was determined by testing for statistical significance, we acknowledge that the effect sizes for some of the identified covariates may be small and of questionable clinical relevance. However, this approach enabled us to be as inclusive as possible without constraining potentially relevant predictors based on preconceived assumptions. Finally, all data in the COVID-19 Citizen Science Study were collected prospectively as an observational study. While this allows for diverse and rapid sampling of a large population to inform global efforts combating the SARS-CoV-2 pandemic, it remains prone to residual and unmeasured confounding.

In conclusion, the COVID-19 Citizen Science Study, in its prospective and time-updated collection of health data, has identified readily modifiable behaviours that may increase one’s individual risk of contracting SARS-CoV-2. Increased number of contacts within 1.83 meters (6 feet), events of 10 or more people and visits to restaurants each independently predicted higher risk of contracting SARS-CoV-2 during the pandemic, while one’s demographics, prior medical comorbidities, and adherence to handwashing and face mask wearing were not significant predictors of SARS-CoV-2. During a resurgence of SARS-CoV-2 and continued strain on local governments to balance transmission risk with restrictions on daily life, our study provides community leaders and members of the public with at least three modifiable health behaviours within an individual’s control that may lower one’s personal risk of contracting SARS-CoV-2 during this pandemic.

Data availability statement

Data are available upon reasonable request. All data relevant to the study are included in the article or uploaded as supplemental information.

Ethics statements

Patient consent for publication

Ethics approval

The study was approved by the University of California, San Francisco Institutional Review Board (#17-21879).


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Twitter @Robertavrammd

  • Contributors GMM, JO and MP are the principal investigators for the COVID-19 Citizen Science Study and obtained funding for the study. The COVID-19 Citizen Science Study software platform was developed and maintained by NP, SJ, VY, JH, RR, DW, XB, CH and HE. AL, NP, SJ, VY, JH, RR, DW, XB, CH and HE assisted in data collection. AL, EV, JO, NP, SA, SJ, VY, JH, RA, GN, GHT, AB, RR, DW, XB, CH, HE, MP and GMM interpreted the data. AL, EV and GMM wrote the initial manuscript. AL, EV, SJ and GN made the figures. AL, EV, JO, NP, SA, SJ, VY, JH, RA, GN, GHT, AB, RR, DW, XB, CH, HE, MP and GMM provided critical comments during analysis of the data, revised the manuscript and approved the final manuscript for submission.

  • Funding This work was supported by IU2CEB021881-01 and 3U2CEB021881-05S1 from the NIH/NIBIB, to GMM, JO and MP.

  • Map disclaimer The inclusion of any map (including the depiction of any boundaries therein), or of any geographic or locational reference, does not imply the expression of any opinion whatsoever on the part of BMJ concerning the legal status of any country, territory, jurisdiction or area or of its authorities. Any such expression remains solely that of the relevant source and is not endorsed by BMJ. Maps are provided without any warranty of any kind, either express or implied.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.