Purpose The depth and breadth of clinical data within electronic health record (EHR) systems paired with innovative machine learning methods can be leveraged to identify novel risk factors for complex diseases. However, analysing the EHR is challenging due to complexity and quality of the data. Therefore, we developed large electronic population-based cohorts with comprehensive harmonised and processed EHR data.
Participants All individuals 30 years of age or older who resided in Olmsted County, Minnesota on 1 January 2006 were identified for the discovery cohort. Algorithms to define a variety of patient characteristics were developed and validated, thus building a comprehensive risk profile for each patient. Patients are followed for incident diseases and ageing-related outcomes. Using the same methods, an independent validation cohort was assembled by identifying all individuals 30 years of age or older who resided in the largely rural 26-county area of southern Minnesota and western Wisconsin on 1 January 2013.
Findings to date For the discovery cohort, 76 255 individuals (median age 49; 53% women) were identified from which a total of 9 644 221 laboratory results; 9 513 840 diagnosis codes; 10 924 291 procedure codes; 1 277 231 outpatient drug prescriptions; 966 136 heart rate measurements and 1 159 836 blood pressure (BP) measurements were retrieved during the baseline time period. The most prevalent conditions in this cohort were hyperlipidaemia, hypertension and arthritis. For the validation cohort, 333 460 individuals (median age 54; 52% women) were identified and to date, a total of 19 926 750 diagnosis codes, 10 527 444 heart rate measurements and 7 356 344 BP measurements were retrieved during baseline.
Future plans Using advanced machine learning approaches, these electronic cohorts will be used to identify novel sex-specific risk factors for complex diseases. These approaches will allow us to address several challenges with the use of EHR.
- health informatics
- statistics & research methods
Data availability statement
Data are available upon reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
By capitalising on the untapped depth and breadth of clinical data available in modern electronic health record (EHR) systems, we can go beyond traditional risk factors and create comprehensive risk profiles for complex diseases.
We created an independent validation cohort of patients from a largely rural area in which to assess the generalisability of our findings from our discovery cohort.
We have biological samples and genomic data in a large subset of patients.
Using innovative machine learning methods will allow us to address several important and challenging questions associated with the use of EHR data.
One limitation of this study is that it may be difficult to develop accurate and transportable EHR phenotype algorithms for some female-specific conditions or procedures.
The wide adoption of electronic health records (EHRs) has led to an unprecedented expansion in the availability of comprehensive longitudinal datasets for research.1 Thus, EHR systems represent an untapped resource for studying life-course biology, multimorbidity and the prediction of complex diseases, such as cardiovascular disease (CVD), dementia, cancers and other ageing-related diseases. However, deriving data from EHRs is challenging and requires extensive harmonisation and processing guided by content experts.
As opposed to research cohort data sources that typically measure a limited set of factors, EHRs capture the full-range of clinical data. However, analysing EHR data can be challenging due to the complex and uneven nature of clinical documentation and data quality.2 Hallmark challenges for leveraging EHR data in predictive modelling include high degrees of data sparsity, incompleteness, noise and biases.3 4 Furthermore, changing and evolving EHR systems within and between institutions add another layer of complexity. Thus data extraction, cleaning, harmonisation, interpretation, management and analyses are major challenges for efficient EHR-based clinical studies.
However, recent advances in data science and machine learning aim to address the uneven nature of clinical documentation intrinsic to EHR data. For example, EHR-based deep learning methods have been proposed for handling missing data imputation,5 as well as for extracting high-level patient data patterns for prediction algorithms.6 Furthermore, extensive effort has been dedicated to develop advanced clinical data processing (eg, natural language processing (NLP) technologies) and data management methodologies (eg, ontology-based approaches) to facilitate EHR-based clinical studies.7 8 Importantly, NLP methods allow for the ascertainment of risk factors recorded in the medical history section of clinical notes that predate EHR systems and/or occurred at another medical centre. Moreover, EHR phenotyping algorithms incorporating multiple data types may be accurate, scalable and transportable.9 10 Thus, our goal is to capitalise on the depth and breadth of clinical data within the EHR systems to revolutionise risk prediction and to optimise personalised care for every patient.
In order to achieve this goal, we assembled a longitudinal cohort of adult patients in a geographically defined area in southeastern Minnesota, a state in the Upper Midwest region of the USA, to serve as the discovery cohort. Comprehensive EHR data over a 15-year period were ascertained, allowing for complete ascertainment of risk factor profiles. Thus, we have the ability to move beyond traditional risk factors to include reproductive factors, age of risk factor onset and a broader spectrum of clinical tests, diagnoses and patient provided information (PPI). Importantly, we have also created a validation cohort of patients from a largely rural area in which to assess the generalisability of our findings and models. With detailed and rich EHR data, these population-based cohorts can be used for a wide-range of studies, including but not limited to studying novel disease associations (risk factors), clusters of disease or creating sex-specific risk scores for disease prediction.
Our study uses the resources of the Rochester Epidemiology Project (REP).11 12 In brief, the REP is a records-linkage system which allows retrieval of nearly all healthcare utilisation and outcomes of residents living in Olmsted County, the home of Mayo Clinic.12 Thus, the REP captures and updates comprehensive EHR-derived phenotypic data within this population, and is uniquely positioned to characterise longitudinal disease trajectories and outcomes in communities. The electronic indexes of the REP include demographic information, diagnostic and procedure codes, healthcare utilisation data, outpatient drug prescriptions, results of laboratory tests and information about smoking, height, weight and body mass index (BMI).
Starting in 2010, the REP population expanded to include an additional 26-county region in southern Minnesota and western Wisconsin. The REP now includes medical record data from many sources of care across the region including the two largest providers of care in these areas (ie, Mayo Clinic, Mayo Clinic Health System clinics and hospitals, and Olmsted Medical Center and its affiliated clinics).11 The expansion of the REP from 1 to 27 counties in the Upper Midwest has increased the size of the population fivefold, and its adoption of innovative electronic platforms are important assets to follow our cohorts. The expanded population now offers breadth and depth of data for a large sample size, thereby providing a powerful resource for more precise risk prediction. Importantly, the recent REP expansion markedly increased the proportion of persons living in rural areas to 50%.11 Additionally, the REP region has similar age, sex and ethnic characteristics as the entire Upper Midwest region of the USA.11 12
Our cohorts, updated under the auspices of the expanded REP, will offer a singular opportunity to address the disproportionate burden of disease experienced by rural populations. Rural disparities have been recently underscored by the Centers for Disease Control and Prevention and the American Heart Association, which have called for studies to understand and address these disparities.13 14
All individuals 30 years of age or older who resided in Olmsted County, Minnesota on 1 January 2006 were identified for the discovery cohort (figure 1). An age cut-off of 30 was selected because ageing-related diseases are infrequent in children and adults aged 18–29. Additionally, traditional risk factors are not routinely screened in this younger population. The Mayo Clinic EHR began phasing in during the 1990s, and primary care and most specialty departments were added by 2000. Thus an index date of 1 January 2006 was selected to allow a sufficient time period for the electronic ascertainment of patient characteristics and risk factors from the EHR, for the identification of prevalent disease and for more than a decade of follow-up to assess incident or secondary events. Similarly, all individuals 30 years of age or older who were residing in the other 26 counties in southern Minnesota and western Wisconsin on 1 January 2013 were identified for the validation cohort. This region has EHR history beginning in 2010; thus, the index date of 1 January 2013 allows an ample window of time for the collection of data and follow-up.
This study was approved by the Mayo Clinic and Olmsted Medical Center Institutional Review Boards.
Baseline data were collected from 2001 to 2005 for the discovery cohort and from 2010 to 2012 for the validation cohort (figure 1). Follow-up for outcomes is ongoing for both cohorts.
Details of our data processing, management and algorithm development are detailed below. All data were collected via the REP, unless indicated otherwise.
Date of birth, sex, race and ethnicity were obtained. Within the REP, race is classified per the US Census: White, Black, Asian, American Indian or Alaskan Native, Native Hawaiian or Pacific Islander. Categories of ‘Other and mixed’ and ‘Unknown’ are also included.11 Ethnicity is classified per the US Census: Hispanic or non-Hispanic.
Baseline clinical measurements
The median heart rate per calendar day was used for analyses. The most recent daily median heart rate among all heart rate measurements for a person during the baseline data collection period was considered the baseline heart rate. All daily values were retained to assess associations with heart rate variability and outcomes.
All systolic blood pressure (SBP) and diastolic blood pressure (DBP) measurements with values of 0 and values that were not whole numbers were excluded. For each measurement the following criteria were applied:
If the measurement was <1000, it was kept as is.
If the measurement was between 1000 and 9999, it was assumed that it was recorded as a two digit SBP and two digit DBP and split apart.
If the measurement was ≥10 000, then it was assumed that it was recorded as a three digit SBP and a two digit DBP and split apart.
All SBP >300 and all DBP >200 were excluded.
Furthermore, some measurements had time recorded as 00:00 and a real time on the same day. When this occurred, the measurement with time=00:00 was dropped. The median SBP and the median DBP per day were calculated. For each day, any instances where median DBP ≥SBP were deleted. The most recent daily median SBP and DBP among all measurements for a person during the baseline data collection period was considered the baseline blood pressure (BP). All daily values were retained to assess BP variability and outcomes.
Height, weight and BMI
All heights and weights per person were extracted ±5 years of the index date for the discovery cohort and ±3 years of the index date for the validation cohort. Using a published method as a guide,15 heights <111.8 cm or >228.6 cm and weights <24.9 kg or >453.6 kg were excluded. For those with more than one height, any height values that met both of the two following conditions were excluded: (1) the absolute difference between that particular height and average height was greater than the SD and (2) the SD was >2.5% of the average height. For those with more than one weight, any weight that met one of the two following conditions was excluded: (1) the range was >22.7 kg and the absolute difference between that specific weight and average weight was >70% of the range or (2) the SD was >20% of the average weight and the absolute difference between that particular weight and average weight was greater than the SD. Heights and weights during the baseline period were retained and all possible BMI combinations were calculated (weight (kg)/height (m2)). The median BMI was calculated and considered the baseline BMI. BMI values <12 or >70 kg/m2 were excluded.
Smoking and tobacco use status
All prior smoking responses through the index date per person were ascertained. First, the most recent response per person was identified. If current smoker was indicated then the baseline smoking status was set to current user. Likewise, if the most recent self-report listed former smoker, then the baseline smoking status was set to former smoker. Finally, if self-report indicated never/not currently, then all prior responses were reviewed. If former smoker was indicated, then smoking status was set accordingly. Otherwise, smoking status was listed as never smoker. The same algorithm was used for tobacco use status.
All International Classification of Diseases, Ninth Revision (ICD-9) and Tenth Revision (ICD-10) diagnosis codes during the baseline period were identified and extracted from the REP electronic indexes. Diagnoses were classified according to the Clinical Classifications Software (CCS), developed at the Agency for Healthcare Research and Quality.16 CCS is a tool for clustering patient diagnoses and procedures into a manageable number of clinically meaningful categories. Additionally, we used the list of 20 chronic conditions recommended by the US Department of Health and Human Services for studying multimorbidity, as defined by ICD-9 and ICD-10 codes.17 18
Procedure history was defined by identifying and extracting all Current Procedural Terminology (CPT) and ICD-9 and ICD-10 procedure codes during the baseline period. Procedures were classified according to the CCS, as described above.
Gynecologic surgeries often predate EHR systems or occurred at another medical centre, thus we applied NLP techniques to extract them from the medical history sections of the clinical narratives of the Mayo Clinic EHR. A rule-based algorithm collects these concepts to classify the status of the gynaecological surgery per each patient as six mutually exclusive categories: ‘no surgery’, ‘bilateral oophorectomy only’, ‘hysterectomy and bilateral oophorectomy’, ‘unilateral oophorectomy only’, ‘hysterectomy and unilateral oophorectomy’ and ‘hysterectomy only’. An expansion of this process to the Olmsted Medical Center EHR is planned.
Female reproductive factors
For the women in the discovery cohort, data were extracted from the following Mayo Clinic Rochester sources: Breast Diagnostic and Cancer Clinic Questionnaire from 2005, Mammography Questionnaire from 2003 to 2005, Mammography database from 2004 to 2005 and the Current Visit Information form from 2001 to 2005. For women in the validation cohort, information in these sources, when available, will be extracted prior to index date and will be augmented with NLP.
Age at menarche
The minimum age of menarche and the most recently reported age of menarche was determined. For the women in whom the minimum does not equal the most recently reported age at menarche, the median of all reports (rounded down to a whole number) was used.
Age at birth of first child
The minimum age at birth of first child and the most recently reported age at birth of first child was determined. For the women in whom the minimum does not equal the most recently reported age at birth of first child, the median of all reports (rounded down to a whole number) was used.
Number of pregnancies and number of live births
The maximum reported number of pregnancies and the most recently reported number of pregnancies were determined. For the women in whom the maximum does not equal the most recently reported number of pregnancies, the median of all reported number of pregnancies (rounded up to a whole number) was used.
Similarly, the maximum reported number of live births and the most recently reported number of live births were determined. For the women in whom the maximum does not equal the most recently reported number of live births, the median of all reported number of live births (rounded up to a whole number) was used.
If a woman ever previously reported breastfeeding her child, then breastfeeding status was set to yes. Otherwise, if all prior reports of breastfeeding were no, then breastfeeding status was set to no.
If a woman ever previously reported menopause, then menopausal status was set to yes. Otherwise, if all prior reports of menopause were no, then menopausal status was set to no.
Between-field checks/corrections were performed. For women who reported an age at birth of their first child, but number of pregnancies=0, both fields were set to missing. For women who reported 0 pregnancies and >0 live births both fields were set to missing. For women who reported breastfeeding, but number of pregnancies=0, both fields were set to missing.
Preterm birth and pregnancy complications
Preterm birth and pregnancy complications including gestational diabetes, gestational hypertension, preeclampsia and eclampsia are identified by diagnoses codes.
All Mayo Clinic ECG quantitative data and narrative and impressions were extracted during the baseline period. Quantitative variables collected include heart rate, P wave, PR interval, QRS interval, QT interval, QT calculated (Bazett) and QT calculated (Fridericia). In addition, raw wave forms for all ECGs are available.
Echocardiography data were retrieved through the Mayo Clinic Echocardiography database during the baseline period. Methods from prior work were used.19 Ejection fraction, interventricular septum thickness end diastole, left atrial (LA) volume end systole, LA volume index end systole, left ventricular (LV) internal dimension end diastole, LV internal dimension end systole, mitral valve systolic effective regurgitant orifice, LV mass and LV mass index values were averaged when multiple measurements were performed. E/A and E/e′ were calculated using the corresponding values. The most severe descriptor word (severe, moderate–severe, moderate, mild–moderate, mild, trivial or none) was used to define aortic regurgitation, aortic stenosis, mitral regurgitation, mitral stenosis, pulmonary regurgitation, pulmonary stenosis, tricuspid regurgitation and tricuspid stenosis. ECG rhythms including atrial fibrillation, atrial flutter and sinus rhythm were ascertained from the echocardiogram. LV size descriptor other than normal (ie, borderline, left, mild, mild–moderate, moderate, moderate–severe or severe) was classified as enlarged. Non-missing values for LV filling pressure were considered increased. LV diastolic dysfunction category (normal, grade 1, grade 1A, grade 2, grade 3, grade 3–4 and grade 4) was collected. Finally, LV wall motion score index was ascertained and when ‘no’ was indicated the score was set to 1 (normal, ie, no regional wall motion abnormalities).
All prescriptions during the baseline period were electronically ascertained. Medications were organised according to the National Drug File Reference Terminology (NDF-RT) classifications. For each NDF-RT class, a variable was created to indicate whether each person had received a prescription for that class in the 1 year prior to index.
All laboratory values were extracted from the electronic laboratory system that started in 1992. Laboratory tests were mapped to Logical Observation Identifiers Names and Codes (LOINC), which is the most widely used classification system for laboratory tests. Tests are often reported in more than 1 unit of measure and LOINC provides a unique code for each.20
Qualitative test results were harmonised such that they conformed to a uniform set of unique outcomes. For example, there are 51 unique tests for ABO blood type and Rh factor available in REP within the time period with different textural representations of the same result (eg, B POS, B POSTIVE, B, POS, POSITIVE). During harmonisation two variables were created, ABO Type (possible values of A, B, AB and O) and Rh Type (possible values of negative or positive).
Results such as ‘not performed’, ‘invalid results’, ‘unable to calculate’, etc. were dropped. The midpoint value was retained for all results reported as a range (eg, 0–2=1).
Patient provided information
PPI from Current Visit Information forms, which patients are asked to complete annually at Mayo Clinic, was extracted for the discovery cohort. Sociodemographic data were retrieved including: educational attainment, employment status, relationship status and with whom the patient currently lives. Functional status data were also retrieved including: does the patient have difficulty eating, dressing, using the toilet, bathing or getting in and out of bed; does the patient have difficulty climbing two flights of stairs, does he/she have home care assistance available if needed, is he/she breathing device dependent, is he/she mobility device dependent and does the patient use dentures or hearing aids. The most recent response during the baseline period for each item was retained for baseline.
A modified Katz Index21 was calculated with the following activities of daily living (ADLs): eating, dressing, using the toilet, bathing or getting in and out of bed. Patients received one point for each ADL that they could perform without difficulty; thus scores could range from 0 (low independence) to 5 (high independence).
Family history of disease
All family history content was retrieved from the ‘family history’ section of unstructured clinical notes. An NLP pipeline (MedTagger) was used to extract mentions of family members.22 Disease mentions were extracted using MetaMap API which used Unified Medical Language System (UMLS) dictionary 2018AA.23 24 UMLS concepts were further mapped to CCS codes. Relationships between family member and disease were extracted using combined semantical rules and distance-based rules.
There are two sources of stored biological specimens on a subset of the discovery and validation cohorts. First, the Mayo Clinic Biobank is an institutional resource comprised of over 56 000 volunteers who donated biological specimens, and provided risk factor data, access to EHR data, and consent to participate in additional studies.25 Biological samples collected on each participant include DNA (median 183 µg), 4 mL serum, 12 mL plasma and an aliquot of frozen white blood cells. The second source of biological samples is the Cardiovascular Disease Repository (CaDRe). CaDRe is a collection of samples (ie, serum, plasma, DNA, buffy coat) collected historically and prospectively from patients with myocardial infarction (MI), coronary artery bypass graft (CABG) surgery, percutaneous coronary interventions (PCI), heart failure and atrial fibrillation in the Olmsted County population.26–30 Currently, approximately 13 000 persons in the discovery cohort and approximately 9000 participants from the validation cohort are participants in at least one of the above mentioned studies.
In 2019, Mayo Clinic formalised a partnership with Regeneron Pharmaceuticals called Project Generation. As part of this collaboration, exome sequencing and genome-wide association data are being generated for all participants of the Mayo Clinic Biobank and CaDRe, which includes approximately 13 000 persons from the discovery cohort and 9000 participants from the validation cohort. Although we do not have these data for everyone in the cohorts, the genomic data available can be used for ancillary studies.
Follow-up and outcomes
Patients are followed after their index date to assess disease and ageing-related outcomes. Below are details of specific outcomes that we have collected thus far.
MIs collected for a long-standing surveillance study were used for this project.27 Residents admitted to Olmsted County hospitals with a troponin T level of 0.03 ng/mL or higher were identified.27 MIs were validated using standard epidemiologic criteria which integrate cardiac pain, ECG changes and elevated biomarkers.31 The presence or absence of a change (rise or fall) between any two troponin T measurements was defined by a difference of at least 0.05 ng/mL, which is greater than the level of imprecision of the assay at all concentrations.32 Circumstances that might invalidate biomarker values were recorded.33
Up to three ECGs per episode were coded using the Minnesota Code Modular ECG Analysis System.34 According to the algorithm, MIs were classified as definite, probable, suspect or no infarction.31 35 Only incident (first-ever) cases were included in the cohort.
PCI and CABG surgery
Data were extracted from the Mayo Clinic Coronary Artery Percutaneous Intervention (PCI) registry. Because Mayo Clinic is the sole provider of coronary angiography in Olmsted County, a complete retrieval is possible via the database. By contrast, CPT codes were used to identify PCI in the validation cohort. For both cohorts, CABG surgery was identified using CPT codes.
Minnesota death certificate and National Death Index Plus data were ascertained. CVD death is defined as underlying cause of death code ICD-9 390–459 and ICD-10 I00–I99.36
A stroke algorithm was trained on an atrial fibrillation (AF) cohort.26 First occurrence of ischaemic strokes, transient ischaemic attack and haemorrhagic strokes after incident AF from 1 January 2000 through 31 March 2015 were identified using diagnostic codes and were validated by trained nurse abstractors who manually reviewed the clinical notes. The algorithm includes diagnosis and procedure codes electronically extracted via the REP indexes and stroke-related keywords. The algorithm was trained using random forest models, and the resulting algorithm involved different weight (importance) on different features (ICD, CPT and keywords). The algorithm identifies stroke incidence dates with a precision of 0.900, recall of 0.918 and F-score of 0.909 in the general population.37
Patient and public involvement
Patients or the public were not involved in the design, conduct, reporting or dissemination plans of this study.
Findings to date
We identified 76 255 individuals (median age 49; 53% women) 30 years of age or older, residing in Olmsted County on 1 January 2006 (table 1) for the discovery cohort. A total of 9 644 221 laboratory results; 9 513 840 diagnosis codes; 10 924 291 service/procedure codes; 1 277 231 outpatient prescriptions; 966 136 heart rate measurements and 1 159 836 BP measurements were retrieved during the baseline time period. Seventy-one thousand two hundred and twenty-two (93%) patients had at least one clinical contact during the baseline period. The five most prevalent conditions in this cohort overall were hyperlipidaemia, hypertension, arthritis, depression and cardiac arrhythmias (table 1).
Women were slightly older than men (50 vs 49 years old) and were less likely to have a diagnosis of hyperlipidaemia, coronary artery disease, diabetes, chronic kidney disease and substance abuse (table 1). Conversely, women were more likely to be diagnosed with chronic obstructive pulmonary disease (COPD), arthritis, osteoporosis, asthma, cancer, depression, anxiety, dementia and schizophrenia.
In preliminary analyses, individuals in the discovery cohort without CVD (n=70 826) were followed from index date through 30 September 2017 for CVD-related outcomes: 1353 MIs, 1476 PCIs, 602 CABG, 912 strokes and 1770 CVD-related deaths occurred.
We identified 333 460 individuals 30 years of age or older residing in the 26-county region of southern Minnesota and western Wisconsin (median age 54; 52% women; table 2) on 1 January 2013. To date, this validation cohort includes a total of 48 587 189 laboratory results; 19 926 750 diagnosis codes; 24 843 462 services/procedures; 7 083 721 outpatient prescriptions; 10 527 444 heart rate measurements and 7 356 344 BPs during the baseline time period. A total of 303 479 (91%) patients had at least one clinical contact during the baseline period. Overall, the five most prevalent conditions were hyperlipidaemia, hypertension, arthritis, diabetes and depression (table 2).
Similar to the discovery cohort, women were slightly older than men (55 vs 54 years old). Women were less likely to have a diagnosis of hypertension, hyperlipidaemia, coronary artery disease, cardiac arrhythmias, heart failure, diabetes, stroke, chronic kidney disease and substance abuse, and were more likely to be diagnosed with COPD, arthritis, osteoporosis, asthma, cancer, depression, anxiety, dementia and schizophrenia (table 2).
Strengths and limitations
By leveraging harmonised and processed EHR data for clinical and translational research, our methods have several strengths. We are capitalising on the untapped depth and breadth of clinical data available in modern EHR systems in order to comprehensively identify risk factors of diseases, thus overcoming the inherent limitation of relying on a relatively small number of risk factors as is common in prospective research cohorts. We are using a foundational model that goes beyond traditional risk factors to include reproductive factors, age at onset of risk factors and a broad spectrum of clinical tests and diagnoses. Importantly, we have also created an independent validation cohort of patients from a largely rural area in which to assess the generalisability and transportability of our findings and models from the discovery cohort. Furthermore, in a large subset of patients, we have biological samples and genomic data. Thus, by developing and extending EHR algorithms for population research, these cohorts include a wide-range of sex-specific and other important risk factors or phenotypes occurring throughout the lifespan. Furthermore, we are identifying barriers and determining best practices for implementing study results from one type of medical practice to another.
Future use of innovative machine learning methods, such as gradient boosting machine and deep learning, will allow us to address several important and challenging questions associated with the use of EHR data such as how to efficiently (1) deal with missing values, (2) assess and use a large number of variables without over-fitting, (3) learn from non-linear relationships in the data and (4) design time-to-event models. In a community EHR environment, missing values will be frequent and will, in many cases, be informative. For example, the fact that a particular test was not ordered can itself be predictive. Traditional modelling approaches, such as linear or Cox regression, do not explicitly handle missing data, and this is one reason that risk modelling has traditionally been confined to prospective research cohorts.
The biggest limitation in utilisation of these techniques is the ability to develop accurate and transportable EHR phenotype algorithms for female-specific variables that are difficult to phenotype (eg, adverse pregnancy outcomes and gynaecological surgeries). Likewise, there can be challenges with determining the correct combination of gynaecological surgeries (eg, unilateral/bilateral oophorectomy with/without hysterectomy) and timing in regards to hormone therapy. By contrast, we do not foresee issues related to identifying male-specific factors, because these conditions are diagnosis based and thus available in the EHR. Finally, we did not collect information regarding usage of over-the-counter medications or supplements and multi-vitamins.
There are some additional limitations for the validation cohort in the 26 counties of southern Minnesota and western Wisconsin. Preliminary information indicates that EHR data will be more limited for this population. In particular, PPI, including family history of disease and difficulty climbing stairs is not routinely electronically available. In addition, because the EHR data are only available from 2010 forward, historic data on reproductive and gynecologic factors are more limited. However, this real world validation step will assess the performance of the phenotype algorithms to determine risk factor status as well as the prediction models including such information when available. If inclusion of historic health information significantly improves the models, we will have evidence that such information should be routinely collected during healthcare visits to adequately assess disease risk. In the future, collection of historic health information may then be incorporated as part of clinical practice to improve disease risk assessment. Furthermore, some healthcare encounters were not captured that occurred outside of the REP. Although coverage varies by county, the REP captures approximately 100% of the Olmsted County population compared to the US Census, whereas coverage of the 26-county population is approximately 60%.11
Finally, the availability of biological samples and genome-wide and exome sequence information on a large sample of cohort participants is a strength. However, those with biologic samples and genomic information were not selected from the population randomly; therefore, they are not representative of the discovery or validation cohort.
With detailed and rich EHR data and using innovative machine learning methods, the population-based cohorts described herein can be used for a wide-range of studies, including but not limited to studies of novel disease associations, defining clusters of disease or creating risk scores for disease prediction.
Data availability statement
Data are available upon reasonable request.
We thank Ellen Koepsell, RN and Mary Roberts for their study support.
Contributors SJB and NBL jointly conceived the study. SJB, NBL, YZ, SM, HL, PAD and JMK handled the data management and analyses. SMM and SJB drafted the manuscript. SMM, JLS, HL, NBL, SM, PYT, JEO, WAR, VMM, TMT, CGN, VLR, YZ, PAD, JMK and SJB critically revised the manuscript for important intellectual content and approved the manuscript.
Funding This work was supported by grants from the National Heart, Lung and Blood Institute (R01 HL136659, R01 HL59205 and R01 HL72435) and the American Heart Association (11SDG7260039) and was made possible by the Rochester Epidemiology Project, Rochester, Minnesota (R01 AG034676) from the National Institute on Aging. The funding sources played no role in the design, conduct, or reporting of this study. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.