Article Text

Download PDFPDF

Can online self-reports assist in real-time identification of influenza vaccination uptake? A cross-sectional study of influenza vaccine-related tweets in the USA, 2013–2017
  1. Xiaolei Huang1,
  2. Michael C Smith2,
  3. Amelia M Jamison3,
  4. David A Broniatowski2,
  5. Mark Dredze4,
  6. Sandra Crouse Quinn3,5,
  7. Justin Cai6,
  8. Michael J Paul1,6
  1. 1 Department of Information Science, University of Colorado, Boulder, Colorado, USA
  2. 2 Department of Engineering Management and Systems Engineering, George Washington University, Washington, District of Columbia, USA
  3. 3 Center for Health Equity, School of Public Health, University of Maryland, College Park, Maryland, USA
  4. 4 Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
  5. 5 Department of Family Science, School of Public Health, University of Maryland, College Park, Maryland, USA
  6. 6 Department of Computer Science, University of Colorado, Boulder, Colorado, USA
  1. Correspondence to Michael J Paul; mpaul{at}


Introduction The Centers for Disease Control and Prevention (CDC) spend significant time and resources to track influenza vaccination coverage each influenza season using national surveys. Emerging data from social media provide an alternative solution to surveillance at both national and local levels of influenza vaccination coverage in near real time.

Objectives This study aimed to characterise and analyse the vaccinated population from temporal, demographical and geographical perspectives using automatic classification of vaccination-related Twitter data.

Methods In this cross-sectional study, we continuously collected tweets containing both influenza-related terms and vaccine-related terms covering four consecutive influenza seasons from 2013 to 2017. We created a machine learning classifier to identify relevant tweets, then evaluated the approach by comparing to data from the CDC’s FluVaxView. We limited our analysis to tweets geolocated within the USA.

Results We assessed 1 124 839 tweets. We found strong correlations of 0.799 between monthly Twitter estimates and CDC, with correlations as high as 0.950 in individual influenza seasons. We also found that our approach obtained geographical correlations of 0.387 at the US state level and 0.467 at the regional level. Finally, we found a higher level of influenza vaccine tweets among female users than male users, also consistent with the results of CDC surveys on vaccine uptake.

Conclusion Significant correlations between Twitter data and CDC data show the potential of using social media for vaccination surveillance. Temporal variability is captured better than geographical and demographical variability. We discuss potential paths forward for leveraging this approach.

  • health informatics
  • world wide web technology
  • public health
  • epidemiology

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


  • Patient consent for publication Not required.

  • Contributors XH, MCS, DAB, MD, SCQ and MJP contributed to the design of the study. XH, JC, MD and MJP contributed to data collection. XH, MCS, JC, DAB and MJP performed data analysis. XH, AMJ, DAB, SCQ and MJP interpreted the results. All authors contributed to the editing of this manuscript.

  • Funding Preparation of this manuscript was supported in part by the National Institute of General Medical Sciences under award number R01GM114771 to DAB and SCQ and by the National Science Foundation under award number IIS-1657338 to XH and MJP.

  • Competing interests MD and MJP hold equity in Sickweather Inc. MD has received consulting fees from Bloomberg LP, and holds equity in Good Analytics Inc. These organisations did not have any role in the study design, data collection and analysis, decision to publish or preparation of the manuscript. All other authors declare no competing interests.

  • Ethics approval This work was conducted under Johns Hopkins University Homewood IRB No. 2011123: ’Mining Information from Social Media', which qualified for an exemption under category 4.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement All Twitter data used in this study are available in the following repository: This repository contains the annotations for training the classifiers, as well as the classifier inferences on the full data set. This also contains the extracted metadata, including demographics and location. In accordance with the Twitter terms of service, raw tweets are not shared, but identifiers are shared which can be used to download the tweets.