Article Text

Download PDFPDF

Original research
Google Trends-based non-English language query data and epidemic diseases: a cross-sectional study of the popular search behaviour in Taiwan
  1. Yu-Wei Chang1,2,
  2. Wei-Lun Chiang3,4,
  3. Wen-Hung Wang5,6,
  4. Chun-Yu Lin1,5,6,
  5. Ling-Chien Hung5,6,
  6. Yi-Chang Tsai7,
  7. Jau-Ling Suen1,8,9,
  8. Yen-Hsu Chen5,6,10,11
  1. 1Graduate Institute of Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
  2. 2Department of Laboratory, Taitung Hospital, Ministry of Health and Welfare, Taitung, Taiwan
  3. 3Pan Media, Taipei, Taiwan
  4. 4OMNInsight Company Limited, Taipei, Taiwan
  5. 5Center for Tropical Medicine and Infectious Disease Research, Kaohsiung Medical University, Kaohsiung, Taiwan
  6. 6Division of Infectious Disease, Department of Internal Medicine, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan
  7. 7Department of Laboratory, Chang-Hua Hospital, Ministry of Health and Welfare, Chang Hua, Taiwan
  8. 8Research Center of Environmental Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
  9. 9Department of Medical Research, Kaohsiung Medical University Hospital, Kaohsiung, Taiwan
  10. 10Department of Internal Medicine, Kaohsiung Municipal Ta-Tung Hospital, Kaohsiung, Taiwan
  11. 11Department of Biological Science and Technology, College of Biological Science and Technology, National Chiao Tung University, HsinChu, Taiwan
  1. Correspondence to Dr Yen-Hsu Chen; infchen{at}gmail.com

Abstract

Objective This study developed a surveillance system suitable for monitoring epidemic outbreaks and assessing public opinion in non-English-speaking countries. We evaluated whether social media reflects social uneasiness and fear during epidemic outbreaks and natural catastrophes.

Design Cross-sectional study.

Setting Freely available epidemic data in Taiwan.

Main outcome measure We used weekly epidemic incidence data obtained from the Taiwan Centers for Disease Control and online search query data obtained from Google Trends between 4 October 2015 and 2 April 2016. To validate whether non-English query keywords were useful surveillance tools, we estimated the correlation between online query data and epidemic incidence in Taiwan.

Results With our approach, we noted that keywords 感冒 (‘common cold’), 發燒 (‘fever’) and 咳嗽 (‘cough’) exhibited good to excellent correlation between Google Trends query data and influenza incidence (r=0.898, p<0.001; r=0.773, p<0.001; r=0.796, p<0.001, respectively). They also displayed high correlation with influenza-like illness emergencies (r=0.900, p<0.001; r=0.802, p<0.001; r=0.886, p<0.001, respectively) and outpatient visits (r=0.889, p<0.001; r=0.791, p<0.001; r=0.870, p<0.001, respectively). We noted that the query 腸病毒 (‘enterovirus’) exhibited excellent correlation with the number of enterovirus-infected patients in emergency departments (r=0.914, p<0.001).

Conclusions These results suggested that Google Trends can be a good surveillance tool for epidemic outbreaks, even in Taiwan, the non-English-speaking country. Online search activity indicates that people are concerned about epidemic diseases, even if they do not visit hospitals. This prompted us to develop useful tools to monitor social media during an epidemic because such media usage reflects infectious disease trends more quickly than does traditional reporting.

  • health informatics
  • epidemiology
  • public health
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This study analysed the association between non-English language queries and epidemic outbreak incidence in a non-English-speaking country.

  • Public opinion during infectious outbreaks was assessed in the study.

  • This study mainly focused on influenza and enterovirus infections, and other seasonal infectious diseases were not evaluated.

  • Confounders such as educational level, age and economic conditions should be considered in the future.

  • More big data are required to comprehensively study ‘infodemiology’.

Introduction

Timeliness of response and public opinion are critical in acute epidemic disease control.1 2 Effective disease surveillance systems and crisis management public relations support public healthcare efforts and the dissemination of accurate health information.3–6 Thus, developing an early warning system for epidemics is critical. However, current epidemic surveillance systems depend on information from laboratory test results, outpatient reports and mortality statistics. Using laboratory results to develop real-time responses involves several weeks of lag before the results are reported.1 Studies have reported that prolonged delays in reporting during epidemic situations hinder efforts to prevent the spread of infectious diseases.1 7 8 Furthermore, inadequate timeliness induces negative public opinion and may cause public relations crises for governments.

With the development of the internet and social media, scientists have used data from Google Trends, health-related tweets and self-established cloud platforms to assess the spread of acute epidemic disease activity and improve individual healthcare.9–13 Estimating infectious disease levels by analysing internet activity enables more sensitive assessments than doing so by studying hospital reports because online activity indicates how concerned people are about epidemic diseases, even when they do not visit hospitals.14–16 Moreover, tracking diseases through internet activity requires less effort than is necessary to evaluate laboratory test results and hospital reports.

Human infection diseases may be characterised by a ubiquitous feature of the seasonal cyclicity, indicating each acute infection has the specific seasonal window of occurrence. However, the seasonality of infectious diseases may vary among geographic locations and differ from other diseases within the same location.17 For instance, influenza is the major seasonal disease and remains a serious public health threat in Taiwan. It has been well defined that the influenza season in Taiwan usually starts from December, and peaks in January to February of the following year.18 In addition, enterovirus infection, a significant cause of neurological disorder and death in children, generally has caused outbreaks during the summer months in Taiwan, and epidemics recur with a seasonal pattern.19

Analysis based on the relative intensity of Google keyword searches can provide near real-time data to be particularly useful in epidemic surveillance and control.20 21 Internet data analysis has certain advantages over surveys and provides options for narrowing data by country, time period and language. However, studies on the value of establishing a surveillance model to estimate epidemic diseases in non-English-speaking countries have not reached a consensus.22–24 Few effective surveillance systems for assessing infectious diseases based on internet activity have been developed, despite the ready availability and use of the internet and social media in Taiwan. Therefore, the objectives of this study were to assess whether Google Trends for non-English words, specifically Chinese words, can be used for an epidemic surveillance system and for monitoring public opinion and managing public relations.

Methods

Setting and study period

National surveillance data on influenza (4 October 2015 to 2 April 2016) and enterovirus (1 January 2012 to 29 December 2012) were obtained from the Taiwan Centers for Disease Control (TCDC), which regularly collects and manages epidemiological data from all cities and counties in Taiwan.

Data sources

Epidemiological surveillance data

A survey by the TCDC is employed for national emerging disease surveillance and disease prevention.25 For the influenza survey, epidemic data were collected and categorised by the weekly number of positive influenza tests, the ratio of emergency department patients with influenza-like illness (ILI), the ratio of outpatient department patients with ILI and weekly deaths from pneumonia and ILI. For the enterovirus survey, the ratio of emergency department patients with enterovirus infections was obtained. With respect to ethical considerations, the open data obtained from the TCDC were anonymous and publicly available.

Query data from Google Trends

Query data were obtained from the Google Trends website provided by Google.26 Query trends from Google Trends indicate normalised results (0–100), which are compared with the maximum value for particular queries during search intervals.3 Based on our approach, 10 non-English influenza and enterovirus-related search terms were enrolled in the analysis (table 1); these were related to names and symptoms of diseases, medical equipment and drugs. For example, disease names categorised as query terms included ‘common cold’, ‘influenza’ and ‘enterovirus’. Table 1 lists the epidemic-related categories and query terms, and online supplementary table 1 simultaneously lists these terms in Chinese with English descriptions. Other criteria for analysis of queries were the following: Chinese (language), search interval described as above and Taiwan (location). We set the location to ‘Taiwan’, search time interval to ‘October 2015 to March 2016’ for influenza, ‘January 2012 to December 2012’ for enterovirus, and the language to ‘Chinese’ and downloaded query information from Google Trends with all search terms in traditional Chinese.

Table 1

Google Trends keywords in this study*

Patient and public involvement

No patients or members of the public were involved in the design and conduct of this study.

Statistical analysis

Initial analysis was conducted by graphically evaluating data trends according to time. Furthermore, a Pearson correlation analysis was conducted using SAS V.9.3 (SAS Institute) to examine the correlation of Google Trends with influenza-related data. One-week lag forecasting analyses were used to assess these relationships temporally (ie, correlation analysis through Google-based search of relative intensity for n weeks and of influenza-related data for n+1 weeks). Correlation coefficients >0.8 indicated excellent correlation, 0.6–0.8 indicated good correlation, 0.4–0.6 indicated moderate correlation and <0.4 indicated poor or no correlation.27–29

Results

ILI surveillance study

The TCDC routinely provides open government data for epidemiological surveillance that is available for infectious disease control and preventive healthcare studies.25 First, we evaluated the benefits of influenza surveillance from anonymous logs from online search engine queries. Through this, we collected national influenza surveillance data from 4 October 2015 to 2 April 2016. During this interval, an influenza outbreak, a natural disaster and an earthquake occurred in Taiwan. This period was suitable to be the research target, which we evaluated whether the Google Trends was suitable to be the surveillance tool of the epidemic disease and the public opinion. A total of 10 queries related to influenza (table 1) were chosen for use in estimating the correlation between query data from Google Trends and influenza-related data obtained from the TCDC concerning the weekly number of patients with influenza, ratio of emergency department patients with ILI, ratio of outpatient department patients with ILI and weekly deaths from pneumonia and ILI. As shown in figure 1, a peak indicating the number of patients with influenza was observed in February 2016 (weeks 6–9); simultaneously elevated levels were evident for three other influenza-related data categories (figures 2–4). Figures 1–4 graphically represent the temporal relationship between and illustrate evident increases in four influenza-related data categories with a simultaneous increase in the relative intensity of Google keywords. After February 2016, the four influenza-related data categories and all keywords decreased simultaneously from March 2016 (after week 10). Table 2 lists the correlation coefficients between Google keyword search intensity and influenza-related data for ‘no lag’ and ‘1-week lag’. These indicate that non-English keywords, such as 感冒 (‘common cold’, r=0.898, p<0.001), 發燒 (‘fever’, r=0.773, p<0.001) and 咳嗽 (‘cough’, r=0.796, p<0.001), had a high correlation with the weekly number of positive influenza test results. When 1-week lag was introduced to the forecasting analysis, similar results were observed for these keywords and presented good to excellent correlation. Keyword search intensity was also highly correlated with ILI-related medical requests, including the ratio of emergency (or outpatient) department patients with ILI and weekly deaths from pneumonia and ILI. For all correlations, the keyword 感冒 (‘common cold’) exhibited the highest correlation of all influenza-related data for ‘no lag’ (r=0.898, p<0.001) and ‘1-week lag’ (r=0.900, p<0.001) analysis and indicated excellent correlation. However, the search intensity of symptom keywords, such as 流鼻水 (‘runny nose’, r=0.076–0.263) and 喉嚨痛 (‘sore throat’, r=0.639–0.783), exhibited weaker correlation with influenza-related data (table 2), which indicated that appropriate non-English (Chinese) keywords reflect influenza levels. Altogether, these results revealed that appropriate non-English (Chinese, such as 感冒, 發燒 and 咳嗽) keyword search intensity can reflect the real-time infectious condition of influenza, including positive rates and overall medical requests for ILI.

Figure 1

Temporal comparison of Google Trends search relative intensity and weekly number of positive influenza tests (4 October 2015 to 2 April 2016).

Figure 2

Temporal comparison of Google Trends search relative intensity and the ratio of emergency department patients with influenza-like illness (ILI) (4 October 2015 to 2 April 2016).

Figure 3

Temporal comparison of Google Trends search relative intensity and the ratio of outpatient department patients with influenza-like illness (ILI) (4 October 2015 to 2 April 2016).

Figure 4

Temporal comparison of Google Trends search relative intensity and weekly deaths of patients with pneumonia and influenza-like illness (ILI) (4 October 2015 to 2 April 2016).

Table 2

Pearson correlation coefficient values for the intensity of influenza-related query terms in Taiwan

Enterovirus 71 infection surveillance study

Enterovirus 71 (EN71) was first identified in California in the USA in 1969. Since then, EN71 has been detected worldwide.30 For Taiwan in particular, EN71 repeatedly causes life-threatening outbreaks of hand, foot and mouth disease and neurological disorders in children.31 32 Using our approach, we estimated whether query data from Google Trends can serve as a surveillance tool for EN71 infections in Taiwan. Figure 5 and table 3 indicate that the query 腸病毒 (‘Enterovirus’) exhibited an excellent correlation (r=0.914, p<0.001) with the ratio of emergency department patients with EN71 infection. However, using search terms such as 水泡 (‘blister’) or 發燒 (‘fever’) exhibited poor to moderate correlation (r=0.478, p<0.001; r=0.359, p<0.001, respectively).

Figure 5

Temporal comparison of Google Trends search relative intensity and the ratio of emergency department patients with enterovirus infection in Taiwan (1January 2012 to 29 December 2012).

Table 3

Pearson correlation coefficient values for the intensity of enterovirus-related query terms in Taiwan

Public opinion estimation

Public opinion analysis is critical for acute epidemic disease control. Moreover, public opinion regarding epidemic diseases is influenced by several external factors that can be classified into the categories of culture, media, opinion leaders and major events.33 Keywords were standard disease names (influenza), medical equipment (extracorporeal membrane oxygenation; ECMO) and drugs (Tamiflu), which were selected to estimate public opinion. Figure 6 illustrates a severe earthquake that occurred during the Chinese New Year holiday (week 6; 6 February 2016). Media and opinion leaders focused on the earthquake; therefore, the search intensity of these keywords did not increase with the peak in epidemic diseases (weeks 6–9; February 2016). After the Chinese New Year holiday, media and opinion leaders refocused on influenza, discussing influenza outbreaks, influenza vaccine policies and medical resource logistics, among other topics. These discussions by thought leaders and media affected public opinion and caused a public relations crisis for the Taiwanese government regarding epidemic policy. Thus, keyword search intensity peaked from the end of February to mid-March (weeks 9–12). A 4-week lag appeared between internet query data and epidemic advancement. Figure 6 shows the peak of the keyword 流感 (‘influenza’) in week 11 (early March 2016), with simultaneously elevated levels of two keywords for medical equipment and drugs. Altogether, these indicated that appropriate non-English (Chinese) keywords reflect the concerns of media and opinion leaders regarding epidemic diseases.

Figure 6

Temporal comparison of Google Trends search relative intensity and weekly number of positive influenza tests (4 October 2015 to 2 April 2016). ECMO, extracorporeal membrane oxygenation.

Discussion

Principal findings

This study confirmed that the Google search intensity of appropriate non-English (Chinese) keywords is a favourable epidemic disease surveillance tool in non-English-speaking countries. Moreover, suitable keywords related to public opinion regarding epidemic diseases, such as disease names (流感, ‘influenza’), medical equipment (葉克膜, ‘ECMO’) and drugs (克流感, ‘Tamiflu’), were useful for estimating public opinion. Online search engine data, such as those of Google Trends, are well suited for disease surveillance in developed countries, which have large populations of internet search users.34 However, few forecasting tools for epidemic diseases are based on online query data, despite the high internet usage in Taiwan. Thus, we assessed whether non-English (Chinese) keywords that appear in Google Trends can be used for an epidemic surveillance system and to monitor public opinion and manage public relations.

These results highlighted the potential use of internet activity related to epidemic diseases to coordinate supplies of medical resources and manage public opinion during influenza outbreaks. To assess online health-related information, Google Trends can combine critical data from a large spectrum of the population with geospatial data to create a surveillance system for a selected geographical area.

Studies had demonstrated the benefits of using social media in infectious disease surveillance in many countries, such as the USA, Japan, South Korea, China, Greece and Italy.16 23 35–38 These findings have revealed suitable queries in various languages for epidemic prediction and clinical studies. Based on our evidence, certain queries showed a higher correlation with epidemic data (eg, common cold, fever and cough in ILI), which may reflect what people concerned about and their web search behaviours in the epidemic outbreak. To the best of our knowledge, our study is the first to estimate the correlation between infectious diseases and internet searches in Taiwan. However, some possible intrinsic limitations regarding the use of big data on epidemic disease surveillance should be concerned in the study. Algorithms and computational techniques, which are built and rely on the analysis, still need to be carefully refined, tuned and calibrated to avoid the overfitting risk in big data inference.24 For instance, web users’ educational level, economic situation, and cultural and language backgrounds can influence users’ habits.34 Comparing to the previous reports,23 we identified traditional Chinese query terms that were significantly correlated with epidemic forecasting, including ‘common cold’ for influenza (r=0.898, p<0.001) and ‘enterovirus’ for EN71 infection (r=0.914, p<0.001). These findings suggested an online query-based surveillance system can be available in Taiwan local language queries for disease prediction but not in simplified Chinese.

Search engines and social media enable people to share information and their experiences during crises, assess message credibility and receive confirmation of information.13 39 Internet data should be incorporated into clinical data for risk and crisis management. Furthermore, internet activity can provide quantifiable assessments of public opinion during disease outbreaks for health authorities, researchers and the media.

The social amplification of risk explains how public risk perception is formed by psychology, mass communication and cultural factors that enhance or attenuate public attention to risk. This study can be extended to quantify social uneasiness and fear during outbreaks and catastrophes and the delivery of information through social media platforms. Moreover, our approach indicates that a surveillance system based on internet activity can be an essential tool for assessing epidemic diseases and public opinion during epidemics and catastrophes in non-English-speaking countries.

Limitations

This study contains some limitations. First, our findings mainly focused on ILI and EN71 infections. The forecasting effects of online query data for the other seasonal infectious diseases remain unclear. In the future, we will develop more prediction models from internet-derived big data to optimise the predictive accuracy of epidemic surveillance. Second, although our research related to pandemics occurring in the past 5 years in Taiwan, the evaluated period was too short to represent long-term conditions well. Our approach provided additional applications for online search data collected by companies such as Google, the most-used search engine in Taiwan. Our study may encourage researchers to use ‘big data’ from social media to track and predict diseases. Third, we only enrolled single-query data for Google Trends, despite this being the main tracking resource in Taiwan. In the future, we will evaluate epidemic predictions in specific regions or language approaches to establish broad benefits for other non-English-speaking countries and use multiple big data sources including other social media (Facebook, Twitter, Baidu or Yahoo!), local meteorology and resident consumption behaviour to evaluate whether they provide information for ‘infodemiology’.40 41

Conclusions

Our study demonstrated that non-English (Chinese) keyword Google search intensity is related to epidemic disease levels as evident in people’s search behaviour. These results suggested that medical information derived from online resources could be crucial for addition to the current epidemic surveillance system in Taiwan.

References

Footnotes

  • Y-WC and W-LC are joint first authors.

  • Contributors YWC and WLC conceived and designed the project. CYL provided the clinical knowledge. YWC, WHW, LCH, YCT and JLS performed the experimental works. YWC and WLC interpreted the analysed results and drafted the manuscript. YHC is the guarantor of integrity of the entire study and is responsible to edit and finally review the paper. All authors have read and approved the final vision to be submitted.

  • Funding This study is supported partially by Kaohsiung Medical University Research Center Grant (KMU-TC108B03), Ministry of Science and Technology, Taiwan (MOST 106-2314-B-037-087 and MOST 107-2314-B-037-079 to YHC) and Ministry of Health and Welfare, Taiwan (Project No 10965 to YWC).

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement The data are available upon reasonable request.