Article Text

## Abstract

**Objectives** Time series models are effective tools for disease forecasting. This study aims to explore the time series behaviour of 11 notifiable diseases in China and to predict their incidence through effective models.

**Settings and participants** The Chinese Ministry of Health started to publish class C notifiable diseases in 2009. The monthly reported case time series of 11 infectious diseases from the surveillance system between 2009 and 2014 was collected.

**Methods** We performed a descriptive and a time series study using the surveillance data. Decomposition methods were used to explore (1) their seasonality expressed in the form of seasonal indices and (2) their long-term trend in the form of a linear regression model. Autoregressive integrated moving average (ARIMA) models have been established for each disease.

**Results** The number of cases and deaths caused by hand, foot and mouth disease ranks number 1 among the detected diseases. It occurred most often in May and July and increased, on average, by 0.14126/100 000 per month. The remaining incidence models show good fit except the influenza and hydatid disease models. Both the hydatid disease and influenza series become white noise after differencing, so no available ARIMA model can be fitted for these two diseases.

**Conclusion** Time series analysis of effective surveillance time series is useful for better understanding the occurrence of the 11 types of infectious disease.

## Statistics from Altmetric.com

### Strengths and limitations of this study

The incidence of 11 notifiable infectious diseases in China from 2009 to 2014 was analysed.

Decomposition methods were used to explore (1) their seasonality expressed in the form of seasonal indices and (2) their long-term trend in the form of a linear regression model.

Except for autoregressive integrated moving average (ARIMA) models for influenza and hydatid disease, the incidence models show good fit.

We could only obtain class C notifiable disease incidence over a period of 6 years because the Chinese Ministry of Health only started to publish these data from 2009. The relatively short length of the series may affect the forecasting efficacy of the time series modelling.

## Background

Infection surveillance in China has improved since 2003, with a web-based infection surveillance system replacing the previous system over 10 years ago, covering the largest population in the world.1 ,2 This web-based surveillance system can report cases of infection, and more infections than previously, in a temporal fashion.3 It potentially saves lives and maintains the health of the whole population. The quality of the surveillance has greatly improved, with the average omission rate decreased to 13%.4 This surveillance system currently monitors 39 notifiable infectious diseases, which can be divided into three classes.5 ,6 Class A includes plague and cholera, which can cause large epidemics in a very short time.7 There are no reports on time series of class A notifiable diseases, as only a few cases have been reported over several decades. Class B includes infectious diseases that might cause epidemics such as tuberculosis, syphilis and viral hepatitis.8 We reported the incidence of class B notifiable diseases in our previous study,7 with sexual diseases, viral hepatitis and tuberculosis being population health challenges.7 Class C includes less severe and less infectious diseases such as hand, foot and mouth disease (HFMD), diarrhoea and influenza.

Various methods have been explored for modelling infection surveillance data over the last few decades, and time series models are commonly used.9 ,10 Decomposition is a typical time series method, aiming to decompose the infection series into seasonal and long-trend patterns.9 This method has been used to analyse the seasonality and secular trend of class B notifiable infectious diseases in China.7 ,9 Autoregressive integrated moving average (ARIMA) models are one of the most widely used infection time series models and have been used to fit tuberculosis,11 typhoid fever,12 gonorrhoea13 and hepatitis.14 ARIMA is composed of a differencing process and an autoregressive and moving average (ARMA) model,15 ,16 which views the infection rate at time t as a linear combination of its previous values and the residuals.

The Chinese Ministry of Health has been reporting class C notifiable diseases to the public since 2009. Systemic time series analyses targeting class C notifiable diseases are greatly needed. Therefore, we performed a time series study on the monthly time series data of 11 class C infectious diseases using the decomposition method and the ARIMA model. We described the data's seasonality and long-term trend and established a time series model.

## Data and methods

We collected the available time series data on 11 class C infectious diseases which were reported monthly by the Chinese Center for Disease Prevention and Control from 2009 to 2014. The 11 diseases were HFMD, diarrhoea, influenza, mumps, leprosy, rubella, kala-azar, hydatid disease, typhus disease, conjunctivitis and filariasis. The data were analysed using the decomposition method and the ARIMA model. All analyses were performed using SAS V.9.3.

### Decomposition method

The decomposition method was introduced in previous studies.9 The method breaks the time series into seasonal indices and long-term trend. Let denote the incidence in the k-th month of the i-th year. Then the seasonal index can be calculated in three steps.

Calculate the average value in each periodwhere n is the number of the time points

Calculate the overall average value

Calculate the seasonal index

The ‘deseasonalised’ series becomes: .

After the seasonality is removed, a simple linear regression model is established between the deseasonalised incidence and time t, which can be presented in the following formula:

The coefficient, constant, R^{2} (coefficient of determination) and p values for the regression model are estimated. Changes in incidence can be derived, on average, by month from the coefficient of the regression.

### ARIMA model

The ARIMA model is widely used in infectious disease time series modelling. As described in previous studies,9 ,12 the model can be formed as ARIMA (p, d, q)×(P, D, Q)s, which can be expressed in the following formula:where , where B is the backward operator, with , , and being lag polynomials. p and q are non-negative integers that refer to the order of the ARMA parts of the model, respectively, while P and Q represent the order of the seasonal ARMA, respectively. ‘d’ is the level of integration of the series, ‘DS’ is the level of seasonal integration, and ‘S’ is the order of seasonality.

The ARIMA modelling procedure consists of three iterative steps: identification, estimation and diagnostic checking.17 Several ARIMA models may be identified, and the selection of an optimum model is usually based on the minimum Akaike information criterion (AIC) and Schwartz Bayesian criterion (SBC).18

Here, the ARIMA models were established from 2009 to 2013, to test the accuracy for the values of 2014. Several ARIMA models were fitted, and the final ARIMA model was selected on the basis of the minimum AIC and SBC scores for each disease. The mean absolute percentage error (MAPE) and mean square error (MSE) for the forecasting data (2014) were also calculated using the final ARIMA model.

## Results

First, the general descriptive analysis is presented, followed by the decomposition method and the ARIMA model results.

### Descriptive analysis

Table 1 shows the number of cases and deaths caused by the 11 class C notifiable diseases from 2009 to 2014. The incidence time series of the disease is shown in figure 1. In total, 20 139 572 cases and 3453 deaths were detected in the surveillance system during the 6 years. HFMD ranks first in terms of both reported cases and deaths (figure 2). The proportion of HFMD cases increased from 48% to 68% from 2009 to 2014, and the proportion of deaths was over 80% each year. The number of diarrhoea cases increased from 2009 to 2013, but fell in 2014. Similarly, mumps cases increased from 2010 to 2012, and fell in 2013 and 2014. The number of influenza cases fell from 2009 to 2010, and increased from 2010 to 2014. However, there was a conjunctivitis outbreak in September and October 2010. The incidences for these 2 months were deemed to be outliers, and they were thus replaced by the mean incidence (0.2703/100 000) of September 2009 and September 2010, and the mean incidence (0.1398/100 000) of October 2009 and October 2010. The number of rubella cases, hydatid cases and leprosy cases fluctuated each year. The number of typhus cases fell from 2009 to 2014. The number of kala-azar cases increased from 2009 to 2013 and only fell in 2014. There are only three random filariasis cases reported in the six years.

### Decomposition

Table 2 and figure 3 present the seasonal indices for each disease. The ranges of seasonal indices of rubella, HFMD, conjunctivitis, mumps and influenza were >1. In general, the occurrence of each disease was greatest during specific months as follows: rubella, April to June (peaked in May); HFMD, April to July (peaked in May); conjunctivitis, July to September (peaked in September). Typhus and hydatid disease did not peak during a specific month but occurred most often from August to October and December, respectively. Diarrhoea and leprosy on the other hand only occurred/peaked in August and May, respectively. There was no fixed seasonality for the incidence of influenza: it occurred most often from September to January (autumn and winter in China) in 2009, whereas from 2011 to 2014 it occurred most often from December to April (winter and spring). There was no obvious seasonality for kala-azar disease.

Estimations of the coefficient, constant, R^{2} and p values for the regression model are shown in table 3. The regression models for influenza, mumps and hydatid disease showed no significance (p>0.05), and R^{2} for the leprosy model was low (R^{2}=0.055). Of the class C notifiable diseases, HFMD incidence increased most rapidly, by an average of 0.14126/100 000 per month. The incidence of rubella, typhus and kala-azar decreased after removal of seasonality.

### ARIMA model

The results of the ARIMA estimations, MAPE and MSE for each disease time series are shown in table 4. The final selected ARIMA model (based on minimum AIC and SBC scores) is shown in bold font. The fitting and forecasting performance of each model is shown in figure 4. The incidence of hydatid disease and influenza after differencing was random (white noise), so no available ARIMA model could be fitted for these two diseases. The other disease series were well fitted. MAPEs for conjunctivitis, typhus and HFMD were as expected (0.1638, 0.1819 and 0.3497, respectively). Those for mumps, rubella, kala-azar and diarrhoea were slightly high (0.5417, 0.6948, 0.6837 and 0.6838, respectively).

## Discussion

Infection surveillance is important in infectious disease management and prevention. In this paper, we use the infection surveillance data to show the infection characteristics of the 11 class C notifiable diseases in China. Of these diseases, HFMD is the most serious in terms of both incidence and death rate, which agrees with previous studies.19 ,20 The disease is caused by enterovirus and coxsackievirus, which are very prevalent in children under the age of five. HFMD can cause herpes in the hands, feet and mouth, as well as other complications such as myocarditis, pulmonary oedema and aseptic meningoencephalitis.21 Some severely affected patients may die because of the rapid progress of the disease. From 2009 to 2014, more than 11 million HFMD cases were detected leading to 3086 fatalities. HFMD appears to occur most often from April to July (peaking in May), and increased, on average, by 0.14126/100 000 per month with seasonality removed. Strategies for the control and prevention of HFMD include promoting healthcare education, improving hygiene conditions in hospitals and schools, and strengthening the control of cross-infection.21

Seasonal patterns are a major cornerstone in understanding subtle but drastic effects of climate change on disease dynamics.7 ,22 From the present analysis of surveillance data on China's population, we conclude that rubella, HFMD and diarrhoea most frequently occur in summer, whereas conjunctivitis and typhus are most prevalent during summer and autumn, and hydatid disease incidence peaks in winter. There is no fixed seasonality for influenza incidence, and no obvious seasonality was detected for kala-azar.

When there is substantial heterogeneity among different years, then conclusions on seasonal patterns based solely on seasonal indices may not be reliable. This may be the case for conjunctivitis, hydatid and influenza disease, as outbreaks in some years may have not been related to seasonal effects. We calculated the seasonal indices for each year as a comparison (see online supplementary appendix figure A1) by using the incidences divided by the average incidence for the corresponding year. The seasonal patterns for influenza in 2009 were slightly different from other years. The incidence of influenza generally peaked in December, January and March, but in 2009 it peaked in September and November. The incidence of conjunctivitis peaked in August and September in 2010, 2011 and 2014, but peaked in June to August in the other years. Hydatid disease showed strong seasonality with consistent peaks in December for every year analysed.

### supplementary appendix

The surveillance data used in this study covered different climate zones and different provinces with diversified urbanisation levels. The heterogeneity of seasonality not only exists in different years, but also occurs in different geographic regions. To support the conclusion, we take influenza data as an example and calculate the regional seasonal indices using the monthly data (http://www.phsciencedata.cn/Share/en/index.jsp) of 31 Chinese provinces from 2009 to 2012 (see online supplementary appendix table A1). The seasonal patterns are slightly different among the different provinces.

The long-term patterns of the 11 types of class C infectious disease have also been shown with a linear regression model between the deseasonalised series and time t. The model shows that rubella, typhus and kala-azar decreased after removal of seasonality with the improvement of public health surveillance and management. The regression model is useful for understanding long-term epidemic trends, which can be applied to forecast future incidence, greatly facilitating management of public health resources such as vaccine preparation.7

An ARIMA model has been established for each disease. All of the incidence models show good fitting performance except those for influenza and hydatid disease. Influenza is a well-known typical infectious disease with seasonal trend23 (range 1.05 in table 2); however, the ARIMA model cannot be applied to it. Possible reasons are the heterogeneity among the different epidemic periods and climate zones mentioned above. The incidence series of hydatid disease becomes white noise after differencing, suggesting that the occurrence of the disease is random without seasonal impact.

The forecasting accuracy is not ideal compared with some other diseases such as typhoid fever12 and syphilis.24 This may be because the relatively short length of the time series reduced the predictive power of the model. This is one of the limitations of the study. More time series data need to be collected for future exploration. Besides collecting more data, relevant secondary variables, such as average monthly temperatures, would provide information about the underlying reasons for seasonal patterns of outbreaks and may enhance future predictions. We collected the average monthly temperature time series data from the National Bureau of Statistics of China and calculated the time series correlation coefficients24 between influenza incidence and average temperature. As this paper mainly focuses on univariate time series analysis, we have placed these results in online supplementary appendix figure A2 as support information. Certain correlations (0.33) were observed between the average temperature and the disease series. In a future study, we will collect data on more environment and weather variables to enhance the infection predictions.

## Acknowledgments

XZ gratefully acknowledges financial support from the China Scholarship Council. We thank Susann Beier for careful proofreading of the manuscript.

## References

## Footnotes

Contributors XZ, TZ, LZ and FH conceived and designed the experiments. XZ and TZ collected the data and performed the statistical analysis. XZ, FH, ZQ, XL, LZ, YL and TZ participated in drafting the manuscript including data analysis and interpretation of results. All authors read and approved the final manuscript.

Funding TZ was supported by Sichuan University grant ‘the Fundamental Research Funds for the Central Universities’ (grant number 2016SCU11006) and the National Natural Science Foundation of China (grant no.81602935). The research is funded by the National Science and Technology Major Special Project ‘Data mining and analysis of the surveillance data of five syndrome pathogens (grant number 2012ZX10004201-006)’. XZ was supported financially by the China Scholarship Council for his doctoral studies.

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Data sharing statement The dataset is available from the corresponding author at scdxzhangtao@163.com.

## Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.