Using big data to improve cardiovascular care and outcomes in China: a protocol for the CHinese Electronic health Records Research in Yinzhou (CHERRY) Study

Introduction Data based on electronic health records (EHRs) are rich with individual-level longitudinal measurement information and are becoming an increasingly common data source for clinical risk prediction worldwide. However, few EHR-based cohort studies are available in China. Harnessing EHRs for research requires a full understanding of data linkages, management, and data quality in large data sets, which presents unique analytical opportunities and challenges. The purpose of this study is to provide a framework to establish a uniquely integrated EHR database in China for scientific research. Methods and analysis The CHinese Electronic health Records Research in Yinzhou (CHERRY) Study will extract individual participant data within the regional health information system of an eastern coastal area of China to establish a longitudinal population-based ambispective cohort study for cardiovascular care and outcomes research. A total of 1 053 565 Chinese adults aged over 18 years were registered in the health information system in 2009, and there were 23 394 deaths from 1 January 2009 to 31 December 2015. The study will include information from multiple epidemiological surveys; EHRs for chronic disease management; and health administrative, clinical, laboratory, drug and electronic medical record (EMR) databases. Follow-up of fatal and non-fatal clinical events is achieved through records linkage to the regional system of disease surveillance, chronic disease management and EMRs (based on diagnostic codes from the International Classification of Diseases, tenth revision). The CHERRY Study will provide a unique platform and serve as a valuable big data resource for cardiovascular risk prediction and population management, for primary and secondary prevention of cardiovascular events in China. Ethics and dissemination The CHERRY Study was approved by the Peking University Institutional Review Board (IRB00001052-16011) in April 2016. Results of the study will be disseminated through published journal articles, conferences and seminar presentations, and on the study website (http://www.cherry-study.org).

achieved through records linkage to the regional system of disease surveillance points and chronic disease management, based on the diagnostic codes from the international classification of diseases (the tenth revision). The CHERRY study will provide a unique platform and serves as a valuable big data resource to investigate cardiovascular risk prediction and population management for primary and secondary preventions of cardiovascular events in China.
: The CHERRY study has been approved by the Peking  The CHERRY study is among the first in China to establish the EHR>based research platform for investigating a wide range of important issues on cardiovascular disease primary and secondary preventions in a real>world circumstance.
The CHERRY study is unique in being able to trace a complete life journey from birth certificates, health checks, primary care, hospitals, disease surveillance and, ultimately, death certificates in nearly 1 million adults in a general Chinese population.
Missing data and conflicting data might be the main limitations of any EHRs>converted big data research in terms of data quality; however, developments in imputation for missing data within longitudinal cohorts and setup of the priority of the data sources for conflicting data may offer alternative solutions.
Although the CHERRY study has a relatively large number of participants, it is a regional cohort located in the developed area of China and as such, will not be nation>representative. Several critical factors affect the quality of underlying data when EHRs were used for research, e.g. appropriate approaches to data management and linkages, operational definition for exposures, ascertainment and adjudication of outcomes, methods for missing data, relevant data analysis, and valid interpretation on clinical or public health integration and utility. These are all essential to generate robust findings. 6  4. How does the regional health system deliver on various CVD prevention practices in terms of the improvements in cardiovascular outcomes?

"
The CHERRY study is a unique, population>based observational research resource, aimed at improving cardiovascular care and outcomes in China. A particular focus will be on cardiovascular risk prediction and population management, providing evidence to improve the primary prevention of cardiovascular events.  Since 2009, this regional system has nearly covered the entire health>related activities within this region of all people registered for health insurance, from birth to death, including children, adolescents, pregnant women, adults and the elderly.

(Supplemental Material $ #%)
Consequently, based on the data sources in this integrated system, the CHERRY study started in 2016, extracting individual participants' data within the system to establish a natural longitudinal population>based ambispective cohort study for cardiovascular care and outcomes research. The study will comprise all participants registered in the system if: (1) were above 18 years of age on 1 st January 2009; (2) had complete information on date of birth, sex, and valid healthcare identifier; and (3) were Chinese nationality. We chose 1 st January 2009 as the date of cohort inception to avoid the integration and preliminary test period of the system and to allow for the full coverage of the regional population. Once included in the CHERRY cohort, individuals remain in it until death or termination of the local health insurance (primarily due to moving out of the province). Follow>up is generally continuous in the health information system. CHERRY will update certain important information such as vital status, clinical outcomes and claims data for all cohort members annually from the administrative databases. A third party, i.e. Wonders  10 Endpoint Events in Clinical Trials. 10 Principal analyses will be based on events classified according to the 10th version of International Classification of Diseases (ICD>10) (( '). Attribution of death refers to the primary cause provided by cause>specific mortality from the death certificates in the health information system. Data undergo annual quality assessments. A description of the death certificates was reported previously. 11 In addition, incidence of CVD, hypertension, diabetes or cancer was also extracted from disease surveillance and management database where cases were required to be reported for disease management by local GPs once their diagnoses were confirmed. Diagnosis of these diseases made by the doctors in all regional hospitals will be automatically sent to the local GPs of patients in the system. Primary and secondary prevention patients were defined as those without (primary) or with (secondary) a known history of CVD (( ').
For environmental exposure monitoring data, we will include exposures to major water and air pollutants from 8 environmental monitoring sites in Yinzhou, including heavy metal contamination and particles with aerodynamic diameter <2.5 Sm (PM 2.5 ). Both have been associated with increased cardiovascular mortality. 12 16 and recommendations by other studies. 15 17 In brief, data cleaning will proceed before examining exposure>outcome associations. Descriptive statistics will be used to determine values outside a plausible range, then outliers will be set to missing. Multiple imputation will be used to impute missing values on the predictors where appropriate. The CVD prediction models will be developed from sex>specific Cox proportional hazards models. The major CVD risk factors in established prediction models, such as the Framingham risk scores, 18 will be retained in our model directly. Then we will evaluate whether the predictive capability of the model will be improved by inclusion of additional predictors using measures of discrimination and reclassification. The clinical performance of the models will be assessed by discrimination statistics, calibration chi>square and plots, net reclassification improvement (NRI) and the integrated discrimination improvement (IDI) index. Cross>validation will be used to evaluate internal consistency; when available, the prediction models will also be evaluated for external validation in other independent cohort studies, such as the Fangshan Cohort Study. 19 All statistical analysis will be conducted using the SAS system Developments in imputation within longitudinal cohorts may offer an alternative solution. Secondly, individual data may be conflicted among different sources of health system. For example, multiple records with similar but slightly different time of diagnosis of one subject may be available from different sources due to the varied timing accuracy. Priority of the sources in terms of the conflicting data will be set up.
Events of one patient within certain range of time will be considered as a single event. The allowed time window is disease>specific. Finally, although the study has a relatively large number of subjects, it is a regional study which is located in the developed area of China. The population is therefore not nation>representative.
The CHERRY study is unique in its ability to trace the complete lifetime health care journey using birth certificates, health checks, primary care visits, hospitalizations, disease surveillance, and ultimately, death certificates for one million adults in the general Chinese population.
Missing data and conflicting data might be the main limitations of any EHR>converted big data research in terms of data quality; however, imputation for missing data within longitudinal cohorts and setup of the priority of data sources for conflicting data may offer alternative solutions.
Although the CHERRY study has a relatively large number of participants, it is a regional cohort located in a developed area of China and as such, will not be nationally representative. electronic health records (EHRs) and population>based cohort studies for CVD 10 epidemiology is useful and growing. EHR>based data is rich with individual>level 11 longitudinal measurement information. However, few existing studies in China to 12 date have successfully assembled cohort study based on EHRs at the population 13 level. Success in linking big data to population>based cohort could promote 14 advances both in improving cardiovascular outcomes and facilitating health care 15 services research. 6 16 17 Several critical factors affect the quality of EHR>based data for research, e.g., 18 appropriate approaches to data management and linkages, operational definitions 19 for exposures, ascertainment and adjudication of outcomes, methods for handling 20 missing data, and valid interpretation of clinical or public health integration and 21 utility. These are all essential to generating robust findings. 7  Chinese population in nutrition transition? 5 3. How do the different screening strategies for targeting people at high risk of 6 CVD perform in a real>world circumstance? 7 4. How does the regional health system deliver various CVD prevention practices 8 in terms of cardiovascular outcome improvement? 9 The CHERRY study is a unique, population>based observational research resource 12 that is aimed at improving cardiovascular care and outcomes in China. A particular 13 focus will be on cardiovascular risk prediction and population management, 14 providing evidence to improve the primary and secondary prevention of 15 cardiovascular events.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  bypass the integration and preliminary test period of the system and to allow for full 27 coverage of the regional population. Once included in the CHERRY cohort, 28  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59   for the general population within the health check package paid for by employers, as 2 part of employee welfare. Moreover, for adults with a history of hypertension or 3 diabetes or individuals aged 60 years or over, a comprehensive medical 4 examination is scheduled routinely at least once every year, for disease 5 management. In total, 53% of individuals aged 40 years or over had at least one 6 general health check information (e.g., blood pressure measurements) within the 7 system. Detailed information related to CVD risk factors within the examination is 8 extracted to CHERRY, which includes blood measurements on glucose, Hemoglobin 9 A1c (HbA1c), and lipid profiles (total cholesterol, HDL cholesterol, LDL cholesterol, 10 triglycerides) as well as urine testing, and so on. Additional information from 11 outpatient and inpatient EMRs will also be supplemented. Laboratory 12 measurements in EMRs of circulating inflammation markers (e.g., homocysteine, 13 C>reactive protein, albumin, and leucocyte count), novel CVD related markers [e.g., 14 N>terminal pro b>type natriuretic peptide (NT>proBNP)], or cardiovascular imaging 15 information (e.g., progression of coronary artery calcium) are included when 16 certificates in the health information system. Data undergo annual quality 1 assessments. A description of the death certificates has been reported previously. 12 2 For non>fatal outcomes, multiple sources exist in the system for the outcome 3 definition, i.e., disease management database (primary care), EMRs database 4 (hospital care), health insurance database, and disease surveillance database 5 (disease registry). We define the disease surveillance database as gold standard. In 6 Yinzhou, CVD, hypertension, diabetes, or cancer cases were required to be reported 7 for disease surveillance and management by local GPs once the diagnoses were 8 confirmed. Diagnosis of these diseases made by the physicians in all regional 9 hospitals will be automatically sent to the local GPs of patients in the system. 10 Criteria used for the diagnosis of incident cardiovascular morbidity in each source 11 were described in ( #'. We define a "definite" event if two or more sources 12 excluding health insurance database reported as a case. A "probable" event is 13 defined if any source (including health insurance database) reported as a case. 14 Cross>validation will be further investigated to improve the data quality and 15 diagnostic validity. Primary and secondary prevention patients are defined as those 16 without (primary) or with (secondary) a known history of CVD (( '). 17 18 19 For environmental exposure monitoring data, we will include exposures to major 20 water and air pollutants from eight environmental monitoring sites in Yinzhou, 21 including heavy metal contamination and particles with aerodynamic diameter <2. 5 22 Vm (PM 2.5 ). These have been associated with increased cardiovascular mortality. 13 23 14 In addition, various meteorological conditions, such as air temperature and 24 precipitation, from all weather stations across Yinzhou during the study period will 25 also be available. Previous studies in China have demonstrated that both short>term 26 (days) and longer>term (months or years) variations in temperature increase CVD 27 morbidity and mortality. 15  A total of 1,053,565 Chinese adults aged over 18 years were registered in the health 2 information system. According to sample size requirements for prediction models, 16  December 2015. Thus, the sample size is generally sufficient for the CHERRY study. 10

& 12
A detailed data analysis plan will follow the checklist in the Transparent reporting of 13 a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) 14 guidelines 17 and recommendations by other studies. 16 18 In brief, data cleaning will 15 proceed before examining exposure>outcome associations. Descriptive statistics 16 will be used to determine values outside a plausible range, then outliers will be set 17 to missing. Multiple imputation will be used to impute missing values for the 18 predictors, where appropriate. CVD prediction models will be developed from 19 sex>specific Cox proportional hazards models. The main CVD risk factors in 20 established prediction models, such as Framingham risk scores, 19 will be retained in 21 our model directly. We will then evaluate whether the predictive capability of the 22 model will be improved by inclusion of additional predictors, using measures of 23 discrimination and reclassification. The clinical performance of the models will be 24 assessed by discrimination statistics, calibration chi>square and plots, net 25 reclassification improvement (NRI) and the integrated discrimination improvement 26 (IDI) index. Cross>validation will be used to evaluate internal consistency; when 27 available, the prediction models will also be evaluated for external validation in 28 The CHERRY study is established to use longitudinal measurements of 7 cardiovascular risk factors and disease prevention strategies in primary care among 8 residents of Yinzhou, China. In practice, the Chinese CVD risk>assessment 9 guidelines recommended a specific risk classification method based on the 10 importance of risk factors on 10>year CVD risk, identified in two major cohort 11 studies in China. 21 However, these cohorts were accrued decades ago and may not 12 reflect the contemporary experience of Chinese population. That is, the rapid 13 economic transformation (industrialization, marketization, urbanization, and 14 globalization) in China has contributed to aging populations, unhealthy lifestyles, 15 environmental changes, and epidemiological transitions. 3 In addition, although 16 recently published China>PAR (Prediction for ASCVD Risk in China) model, 22 17 including traditional CVD risk factors for CVD prediction in Chinese, could be the 18 potential tool, this has not been independently validated and not implemented in 19 real clinical practice. Disparities in risk factor distributions, baseline survival, and 20 composition of disease subtypes were observed within China. Cutoffs to be used for 21 5> and 10>year risk predictions in Chinese population require more evidence from 22 real>word circumstances. The TRIPOD guidelines also recommended that prediction 23 models using real>world data should be developed. 17 We, therefore, aim to search 24 for the up>to>date CVD risk assessment tool for cardiovascular disease among a 25 Chinese population under the current level of economic development in real>world 26 clinical practice settings. We will also provide evidence for different screening 27 strategies in a real>world circumstance. We expect that these data will be useful in 28  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  Linking big data from EHRs to a population>based cohort will be a powerful tool for 3 investigating quality of care and improving cardiovascular outcomes. Big data 4 studies in developed countries are generally robust. For example, the CALIBER 4 in 5 the United Kingdom is based on integrated health care databases with both 6 nationwide EHRs in primary care and ongoing national quality registries; Sweden 7 has similar health systems, but its primary care is organized regionally. 23 In 8 addition, regional primary and ambulatory care data are also available for research 9 linkages in the CANHEART study in Ontario, Canada. 5 Unfortunately, large>scale big 10 data research on CVD that is based on EHRs is currently underrepresented in 11 China. 24 Although cardiovascular disease registries in China, such as the Chinese 12 National Stroke Registry (CNSR), 25  databases. 26 The CHERRY study has been inspired by all these studies, but differs in 20 terms of its outstanding GP>based primary care units and unique integrated 21 information system. To our knowledge, the CHERRY study is among the first in 22 China to establish a research platform by linking big data across primary and 23 secondary care and disease surveillance. In particular, it is unique in the ability to 24 trace a complete lifetime health care journey in one million adults in a general 25 Chinese population. CHERRY uses coded and inherently linked EHRs from primary 26 care, hospitals, disease surveillance, and ultimately, death registries. Particularly, 27 many EHR>based cohorts in developed countries do not have complete information 28 on lifestyle factors (e.g., smoking status, alcohol use, and others), 27 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   16 The study has some limitations. EHR data are known to suffer from a variety of data 1 quality problems. Conflicting data across different sources in EHR>based data also 2 exist in CHERRY. In CALIBER study, 32 the completeness and diagnostic validity of 3 myocardial infarction recording varied across four EHRs sources in primary care, 4 hospital care, disease registry, and mortality register. 31.0% of patients with 5 non>fatal acute myocardial infarction was recorded in three out of four sources and 6 63.9% in at least two sources. Each data source missed a substantial proportion 7 (25>50%) of myocardial infarction events. A similar situation occurred in CHERRY. 8 In addition, multiple records with similar but slightly different times of diagnosis for 9 one patient may be recorded from different sources owing to varying timing 10 accuracy. Prioritization of sources in terms of conflicting data will be set up. Disease 11 surveillance was considered as gold standard. Events for one patient within a certain 12 time range will be considered a single event; the allowed time window is 13 disease>specific. Secondly, missing data is one of the main limitations of any 14 number of participants, it is a regional study that is located in a developed area of 20 China. The study population is therefore not nationally representative. privacy, confidentiality, and informed>consent issues are being carefully studied by 27 many parties, and solutions are still in progress. 33 As China currently has not set its 28 for information on EHR data for health research purposes, as well as seeking 4 approval by institutional review boards (IRBs) based on international standards. For 5 language and security reasons, foreign researchers are encouraged to apply 6 through their Chinese partners, to facilitate international research collaborations. 7 Although participants in the system are not provided with informed>consent as their 8 information is routinely collected health data, the administrative data are inherently 9 linked using unique encrypted identifiers to ensure privacy and confidentiality by 10 the third>party company (Wonders Information Co., Ltd.). The CHERRY study has 11 been approved by the Peking University Institutional Review Board 12 (IRB00001052>16011) and the local health authority. Results of the study will be 13 disseminated through published journal articles, conferences and seminar 14 presentations. More details will be published on the study website 15 (http://www.cherry>study.org). 16 17 In summary, the CHERRY study has the potential to provide population>based 18 insights into the quality and outcomes of cardiovascular care. CHERRY will serve to 19 decrease the burden of obtaining data in a research>ready format and encourage 20 research collaboration. 21     1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59

Strengths and limitations of this study
The Chinese Electronic health Records Research in Yinzhou (CHERRY) study is a large, natural population-based, observational cohort study linking big data of integrated individual-level electronic health records (EHRs).
The CHERRY study is among the first in China to establish a research platform from EHRs system for investigating a wide range of important issues regarding primary and secondary prevention of cardiovascular disease in a real-world circumstance.
The CHERRY study is unique in its ability to trace the complete lifetime health care journey using birth certificates, health checks, primary care visits, hospitalizations, disease surveillance, and ultimately, death certificates for one million adults in the general Chinese population.
Missing data and conflicting data might be the main limitations of any EHR-converted big data research in terms of data quality; however, imputation for missing data within longitudinal cohorts and setup of the priority of data sources for conflicting data may offer alternative solutions.
Although the CHERRY study has a relatively large number of participants, it is a regional cohort located in a developed area of China and as such, will not be nationally representative.

Page 4 of 40
For peer review only -http://bmjopen.bmj.com/site/about/guidelines.xhtml BMJ Open 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y  1. What are the emerging risk factors, assembling both individual-level and  1 community-level characteristics, for the incidence of major cardiovascular 2 events in this developed area of China? 3 2. What is the up-to-date and suitable CVD risk assessment model to use for 4 Chinese population in nutrition transition? 5 3. How do the different screening strategies for targeting people at high risk of 6 CVD perform in a real-world circumstance? 7 4. How does the regional health system deliver various CVD prevention practices 8 in terms of cardiovascular outcome improvement? 9 10 Methods and analysis 11 The CHERRY study is a unique, population-based observational research resource 12 that is aimed at improving cardiovascular care and outcomes in China. A particular 13 focus will be on cardiovascular risk prediction and population management,  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  routine primary care services for local general practitioners (GPs). It was then 10 gradually integrated with information on public health surveillance, population 11 screening, disease management, health information system in hospitals and other 12 healthcare services. Since 2009, this regional system has covered nearly all 13 health-related activities of residents within this region, from birth to death, 14 including children, adolescents, pregnant women, adults and elderly people ( Figure  15 2 and Figure S1). Now 98% of permanent residents in Yinzhou have registered in 16 the health information system with a valid healthcare identifier. owing to moving out of the province). Follow-up is generally continuous in the 2 health information system. CHERRY will update certain important information such 3 as vital status, clinical outcomes, and claims data for all cohort members annually 4 from the administrative databases. A third party, Wonders Information Co., Ltd., 5 was engaged to handle linkage and safe storage of the linked datasets, ensuring 6 privacy protection in the CHERRY study. A description of the CHERRY research 7 cohort in relation to data sources in the administrative health information system is 8 shown in Figure 2 and Table S1. Each source captures a different aspect of a 9 person's lifetime health care journey, as follows. 10 11

Socio-demographics 12
Basic demographic and socioeconomic information of residents stems from the 13 population census and registered health insurance database in the health 14 information system. Key data variables will include date of birth, sex, ethnic group 15 (e.g., Han, Muslim), marital status, education, occupation, and household 16 information such as income, living space, etc. 17

Longitudinal measurement of cardiovascular risk factors 19
Local GPs in Yinzhou have built up an impressive scheme on frequent health checks 20 among adults and regular epidemiological surveys as part of primary care routine 21 services over the 10 years after China's healthcare reform was initially launched. 9 22 According to the New Rural Cooperative Medical Scheme (NRCMS) in China, 10 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  for the general population within the health check package paid for by employers, as 2 part of employee welfare. Moreover, for adults with a history of hypertension or 3 diabetes or individuals aged 60 years or over, a comprehensive medical 4 examination is scheduled routinely at least once every year, for disease 5 management. In total, 53% of individuals aged 40 years or over had at least one 6 general health check information (e.g., blood pressure measurements) within the 7 system. Detailed information related to CVD risk factors within the examination is 8 extracted to CHERRY, which includes blood measurements on glucose, Hemoglobin 9 A1c (HbA1c), and lipid profiles (total cholesterol, HDL cholesterol, LDL cholesterol, 10 triglycerides) as well as urine testing, and so on. Additional information from 11 outpatient and inpatient EMRs will also be supplemented. Laboratory 12 measurements in EMRs of circulating inflammation markers (e.g., homocysteine, 13 C-reactive protein, albumin, and leucocyte count), novel CVD related markers [e.g., 14 N-terminal pro b-type natriuretic peptide (NT-proBNP)], or cardiovascular imaging 15 information (e.g., progression of coronary artery calcium) are included when 16 available. The core variables for CVD-related factors and longitudinal 17 measurements in the CHERRY study are listed in Table 1. Revision (ICD-10) ( Table 2). In the CHERRY study, for fatal outcomes, attribution 28 of death refers to the primary cause provided by cause-specific mortality on death 29 certificates in the health information system. Data undergo annual quality 1 assessments. A description of the death certificates has been reported previously. 12 2 For non-fatal outcomes, multiple sources exist in the system for the outcome 3 definition, i.e., disease management database (primary care), EMRs database 4 (hospital care), health insurance database, and disease surveillance database 5 (disease registry). We define the disease surveillance database as gold standard. In 6 Yinzhou, CVD, hypertension, diabetes, or cancer cases were required to be reported 7 for disease surveillance and management by local GPs once the diagnoses were 8 confirmed. Diagnosis of these diseases made by the physicians in all regional 9 hospitals will be automatically sent to the local GPs of patients in the system. 10 Criteria used for the diagnosis of incident cardiovascular morbidity in each source 11 were described in Table S2. We define a "definite" event if two or more sources 12 excluding health insurance database reported as a case. A "probable" event is 13 defined if any source (including health insurance database) reported as a case. 14 Cross-validation will be further investigated to improve the data quality and 15 diagnostic validity. Primary and secondary prevention patients are defined as those 16 without (primary) or with (secondary) a known history of CVD ( Table 2). 17 18

Environmental and ecological characteristics 19
For environmental exposure monitoring data, we will include exposures to major 20 water and air pollutants from eight environmental monitoring sites in Yinzhou, 21 including heavy metal contamination and particles with aerodynamic diameter <2. 5 22 µm (PM 2.5 ). These have been associated with increased cardiovascular mortality. 13 23 14 In addition, various meteorological conditions, such as air temperature and 24 precipitation, from all weather stations across Yinzhou during the study period will 25 also be available. Previous studies in China have demonstrated that both short-term 26 (days) and longer-term (months or years) variations in temperature increase CVD 27 morbidity and mortality. 15  A total of 1,053,565 Chinese adults aged over 18 years were registered in the health 2 information system. According to sample size requirements for prediction models, 16  December 2015. Thus, the sample size is generally sufficient for the CHERRY study. 10 11

Data analysis plan 12
A detailed data analysis plan will follow the checklist in the Transparent reporting of 13 a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) 14 guidelines 17 and recommendations by other studies. 16 18 In brief, data cleaning will 15 proceed before examining exposure-outcome associations. Descriptive statistics 16 will be used to determine values outside a plausible range, then outliers will be set 17 to missing. Multiple imputation will be used to impute missing values for the 18 predictors, where appropriate. CVD prediction models will be developed from 19 sex-specific Cox proportional hazards models. The main CVD risk factors in 20 established prediction models, such as Framingham risk scores, 19 will be retained in 21 our model directly. We will then evaluate whether the predictive capability of the 22 model will be improved by inclusion of additional predictors, using measures of 23 discrimination and reclassification. The clinical performance of the models will be 24 assessed by discrimination C statistics, calibration chi-square and plots, net 25 reclassification improvement (NRI) and the integrated discrimination improvement 26 (IDI) index. Cross-validation will be used to evaluate internal consistency; when 27 available, the prediction models will also be evaluated for external validation in 28  real-word circumstances. The TRIPOD guidelines also recommended that prediction 23 models using real-world data should be developed. 17 We, therefore, aim to search 24 for the up-to-date CVD risk assessment tool for cardiovascular disease among a 25 Chinese population under the current level of economic development in real-world 26 clinical practice settings. We will also provide evidence for different screening 27 strategies in a real-world circumstance. We expect that these data will be useful in 28 Linking big data from EHRs to a population-based cohort will be a powerful tool for 3 investigating quality of care and improving cardiovascular outcomes. Big data 4 studies in developed countries are generally robust. For example, the CALIBER 4 in 5 the United Kingdom is based on integrated health care databases with both 6 nationwide EHRs in primary care and ongoing national quality registries; Sweden 7 has similar health systems, but its primary care is organized regionally. 23 In 8 addition, regional primary and ambulatory care data are also available for research 9 linkages in the CANHEART study in Ontario, Canada. 5 Unfortunately, large-scale big 10 data research on CVD that is based on EHRs is currently underrepresented in 11 China. 24 Although cardiovascular disease registries in China, such as the Chinese 12 National Stroke Registry (CNSR), 25  databases. 26 The CHERRY study has been inspired by all these studies, but differs in 20 terms of its outstanding GP-based primary care units and unique integrated 21 information system. To our knowledge, the CHERRY study is among the first in 22 China to establish a research platform by linking big data across primary and 23 secondary care and disease surveillance. In particular, it is unique in the ability to 24 trace a complete lifetime health care journey in one million adults in a general 25 Chinese population. CHERRY uses coded and inherently linked EHRs from primary 26 care, hospitals, disease surveillance, and ultimately, death registries. Particularly, 27 many EHR-based cohorts in developed countries do not have complete information 28 on lifestyle factors (e.g., smoking status, alcohol use, and others), 27 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   15 of the population has at least one measurement of smoking status or alcohol use in 1 CHERRY ( Table 3). 2 3 It is known that less than 10 potentially modifiable risk factors might account for 4 more than 90% of the population attributable risk of CVD worldwide. 28 29 However, 5 disparities in the effects of individual risk factors on CVD have also been found 6 across different populations. Although the prevalence of CVD is declining in many 7 developed countries with effective risk-lowering strategies for cardiovascular risk 8 factors, such as smoking cessation or salt reduction, the prevalence of CVD in China 9 is still increasing. The current focus on CVD prevention in the latest guidelines 10 emphasizes the use of risk assessment for appropriate prevention strategies aimed 11 at those with a high risk of CVD. 30 This is consistent with the objectives of the 12 CHERRY study. 13 14 According to the checklist of the TRIPOD guidelines, 17 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   16 The study has some limitations. EHR data are known to suffer from a variety of data 1 quality problems. Conflicting data across different sources in EHR-based data also 2 exist in CHERRY. In CALIBER study, 32 the completeness and diagnostic validity of 3 myocardial infarction recording varied across four EHRs sources in primary care, 4 hospital care, disease registry, and mortality register. 31.0% of patients with 5 non-fatal acute myocardial infarction was recorded in three out of four sources and 6 63.9% in at least two sources. Each data source missed a substantial proportion 7 (25-50%) of myocardial infarction events. A similar situation occurred in CHERRY. 8 In addition, multiple records with similar but slightly different times of diagnosis for 9 one patient may be recorded from different sources owing to varying timing 10 accuracy. Prioritization of sources in terms of conflicting data will be set up. Disease 11 surveillance was considered as gold standard. Events for one patient within a certain 12 time range will be considered a single event; the allowed time window is 13 disease-specific. Secondly, missing data is one of the main limitations of any 14 EHR-converted research platform in terms of data quality. In CHERRY, data 15 completeness varies [e.g., 85.47% of people have at least one record on body mass 16 index (BMI) measurement and 79.07% have their educational level recorded in the 17 system ( Table 3)]. Developments in imputation within longitudinal cohorts may 18 offer an alternative solution. Finally, although the study has a relatively large 19 number of participants, it is a regional study that is located in a developed area of 20 China. The study population is therefore not nationally representative. 21 22 EHR use is becoming routine. Responsible data sharing is currently being defined, 23 with principles established and policies set globally, such as the Health Insurance 24