Article Text

Protocol
Using electronic health records to enhance surveillance of diabetes in children, adolescents and young adults: a study protocol for the DiCAYA Network
  1. Annemarie G Hirsch1,
  2. Sarah Conderino2,
  3. Tessa L Crume3,
  4. Angela D Liese4,
  5. Anna Bellatorre3,
  6. Stefanie Bendik2,
  7. Jasmin Divers5,
  8. Rebecca Anthopolos2,
  9. Brian E Dixon6,7,
  10. Yi Guo8,
  11. Giuseppina Imperatore9,
  12. David C Lee2,
  13. Kristi Reynolds10,
  14. Marc Rosenman11,12,
  15. Hui Shao13,
  16. Levon Utidjian14,
  17. Lorna E Thorpe2
  18. The DiCAYA Study Group
    1. 1Department of Population Health Sciences, Geisinger, Danville, Pennsylvania, USA
    2. 2Department of Population Health, New York University Grossman School of Medicine, New York, New York, USA
    3. 3Department of Epidemiology, Lifecourse Epidemiology of Adiposity and Diabetes (LEAD), University of Colorado - Anschutz Medical Campus, Aurora, Colorado, USA
    4. 4Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, South Carolina, USA
    5. 5Department of Foundations of Medicine, New York University Long Island School of Medicine, Mineola, New York, USA
    6. 6Department of Epidemiology, Fairbanks School of Public Health, Indiana University, Indianapolis, Indiana, USA
    7. 7Center for Biomedical Informatics, Regenstrief Institute Inc, Indianapolis, Indiana, USA
    8. 8Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
    9. 9Division of Diabetes Translation, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
    10. 10Departmnt of Research & Evaluation, Kaiser Permanente Southern California, Pasadena, California, USA
    11. 11Department of Pediatrics, Ann & Robert H. Lurie Children's Hospital of Chicago, and Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
    12. 12Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
    13. 13Department of Pharmaceutical Outcomes and Policy, University of Florida College of Pharmacy, Gainesville, Florida, USA
    14. 14Department of Biomedical and Health Informatics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, USA
    1. Correspondence to Dr Annemarie G Hirsch; aghirsch{at}geisinger.edu

    Abstract

    Introduction Traditional survey-based surveillance is costly, limited in its ability to distinguish diabetes types and time-consuming, resulting in reporting delays. The Diabetes in Children, Adolescents and Young Adults (DiCAYA) Network seeks to advance diabetes surveillance efforts in youth and young adults through the use of large-volume electronic health record (EHR) data. The network has two primary aims, namely: (1) to refine and validate EHR-based computable phenotype algorithms for accurate identification of type 1 and type 2 diabetes among youth and young adults and (2) to estimate the incidence and prevalence of type 1 and type 2 diabetes among youth and young adults and trends therein. The network aims to augment diabetes surveillance capacity in the USA and assess performance of EHR-based surveillance. This paper describes the DiCAYA Network and how these aims will be achieved.

    Methods and analysis The DiCAYA Network is spread across eight geographically diverse US-based centres and a coordinating centre. Three centres conduct diabetes surveillance in youth aged 0–17 years only (component A), three centres conduct surveillance in young adults aged 18–44 years only (component B) and two centres conduct surveillance in components A and B. The network will assess the validity of computable phenotype definitions to determine diabetes status and type based on sensitivity, specificity, positive predictive value and negative predictive value of the phenotypes against the gold standard of manually abstracted medical charts. Prevalence and incidence rates will be presented as unadjusted estimates and as race/ethnicity, sex and age-adjusted estimates using Poisson regression.

    Ethics and dissemination The DiCAYA Network is well positioned to advance diabetes surveillance methods. The network will disseminate EHR-based surveillance methodology that can be broadly adopted and will report diabetes prevalence and incidence for key demographic subgroups of youth and young adults in a large set of regions across the USA.

    • adolescent
    • epidemiology
    • diabetes & endocrinology
    • health informatics
    • paediatric endocrinology
    http://creativecommons.org/licenses/by-nc/4.0/

    This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

    Statistics from Altmetric.com

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

    STRENGTHS AND LIMITATIONS OF THIS STUDY

    • Electronic health record-based surveillance systems offer a potential opportunity to obtain more efficient and timely information on disease prevalence and incidence than is obtained from traditional disease surveillance.

    • The Diabetes in Children, Adolescents and Young Adults (DiCAYA) Network’s large and diverse population will facilitate the estimation of diabetes prevalence and incidence by diabetes subtype for key demographic subgroups of youth and young adults in a large set of regions across the USA.

    • The diversity of clinical centres in the DiCAYA Network allows for the development and dissemination of surveillance methodology that is generalisable to a variety of settings with access to electronic health record data.

    • Because electronic health record data are limited to individuals affiliated with the reporting health systems and these populations may differ from the general population, the DiCAYA Network is testing bias adjustment and denominator selection approaches.

    • A limitation of using electronic health record data for surveillance is that the data were collected for a different purpose (ie, clinical care and billing) and thus lack the rigour and standardisation of traditional research data.

    Introduction

    More than 529 million people worldwide and 37 million people in the USA have diabetes.1–4 Among children and adolescents in the USA, diabetes is now the third most common chronic disease.5 Although prevalence and incidence have recently stabilised in the adult population, diabetes continues to increase among youth, with patterns varying by race and ethnicity.6–9 People with early-onset diabetes face higher risk of chronic kidney disease, myocardial infarction and stroke at younger ages than those who develop diabetes later in life.10–13 Surveillance of diabetes is thus a critical function for public health authorities, in understanding the changing epidemiology of diabetes, guiding prevention strategies, allocating resources to at-risk communities and informing health policies for different age groups.

    Achieving timely and valid estimates of diabetes is challenging. National estimates of diabetes in adults in the USA are based on surveys, but survey-based methods have been challenged by declining response rates and growing concerns regarding non-response bias.14 These methods can also be limited in their ability to identify diabetes type and to produce reliable estimates in children and adolescents.15 For example, the prevalence of diabetes by type using the National Health and Nutrition Examination Survey data is based on age of diagnosis and insulin use, an approach that is susceptible to type misclassification as patterns and treatment of type 1 and type 2 diabetes (T1D and T2D) change.15 Therefore, data on trends in incidence and prevalence of diabetes by type among young adults are poorly understood. Among youth, survey-based approaches also generate less accurate estimates given the lower disease burden in this age group.15

    To address the limitations of traditional disease surveillance approaches, the Centers for Disease Control and Prevention (CDC) developed specialised surveillance efforts, including the SEARCH for Diabetes in Youth (SEARCH) study in 2000 and the Diabetes in Young Adults study in 2017 to establish diabetes registries using active case-finding surveillance efforts from networks of health systems.16 17 These initiatives have provided critical findings on the epidemiology of diabetes by type in children and young adults. SEARCH teams also piloted and validated new methods for improving the timeliness, efficiency and sustainability of surveillance of youth-onset diabetes using electronic health records (EHRs).18 Other federally funded consortiums, including the Surveillance, Prevention and ManagEment of Diabetes Mellitus Study19 and the Veterans Affairs Diabetes Epidemiology Cohort,20 have developed EHR-based approaches for identifying adults with diabetes, though the methods did not differentiate by diabetes subtype. Findings of these and other studies21 suggested that EHR-based surveillance had promise, but further refinement of methods across broader geographical areas was needed.

    In 2020, CDC and the National Institutes of Diabetes and Digestive and Kidney Diseases jointly funded the Diabetes in Children, Adolescents and Young Adults (DiCAYA) Network through 2025. The DiCAYA Network aims to advance the efficiency, flexibility, sustainability and transportability of diabetes surveillance efforts in youth and young adults through the use of large-volume EHR data. The DiCAYA Network was competed through an open request for proposals process involving scientific review from CDC. Geographically diverse sites around the USA were selected to work together as a network to jointly develop and evaluate innovative approaches to surveillance of diabetes in the target populations. The premise behind DiCAYA was that EHR-based surveillance holds promise for being relatively low cost, as no additional efforts for prospective data collection are required. Importantly, EHR systems can provide timely surveillance data, as data are collected in real time as people interact with the healthcare system, and case identification can be automated. EHR systems also offer large population sizes that overcome the sample size challenges of monitoring relatively rare diseases. The DiCAYA Network will conduct network-wide diabetes surveillance and test bias-adjustment methods, with the goal of informing future EHR-based surveillance strategies at the national level. The network will disseminate EHR-based surveillance methodology that can be broadly adopted and will report diabetes prevalence and incidence in youth and young adults by subtype, race/ethnicity, sex and age. This paper describes the DiCAYA Network, its structure and the methods that will be used to conduct EHR-based diabetes surveillance in youth and young adults in the USA. The protocol represents the work of multiple public health researchers and practitioners, all of whom aim to collectively advance surveillance of diabetes using EHR systems, applying methods that can be replicated by other institutions.

    Methods/design

    Network overview

    The DiCAYA Network is spread across eight US-based centres and a Coordinating Centre (CoC), with three centres conducting surveillance in youth aged 0–17 years only (component A), three centres conducting surveillance in young adults aged 18–44 years only (component B) and two centres conducting surveillance in both youth and young adults (components A and B) (figure 1). Component A sites include OneFlorida+ (OFL), PEDSnet and University of South Carolina (UofSC). Component B sites include Geisinger, Indiana University along with the Regenstrief Institute (IU/Regenstrief) and Kaiser Permanente Southern California (KPSC). Two centres are both component A and B sites, Lurie Children’s Hospital (Lurie Children’s) and University of Colorado Denver (CO). The CoC is housed at New York University (NYU) Langone Health, with researchers at NYU Long Island School of Medicine and NYU Grossman School of Medicine. The Network is composed of three types of centres—geographical-based centres (CO and UofSC), membership-based centres (KPSC) and health system centres (PEDSnet, Geisinger, IU/Regenstrief, Lurie Children’s and OFL). Geographical-based centres represent well-delineated geographical and administrative areas and are designed to cover the entire states of Colorado and South Carolina. The membership-based centre, KPSC, is an integrated healthcare delivery system that combines health coverage and care delivery. Members of the health plan prepay for and access all aspects of healthcare from the same system, while health system centres represent healthcare delivery systems that deliver care to patients, with a range of payers (including the uninsured), who may not receive all aspects of their care from a single healthcare delivery system. Membership-based and health system centres access data from their given EHR data repository or from existing National Patient-Centered Outcomes Research Network (PCORnet) clinical research networks (CRNs). Geographical-based centres receive and integrate independent EHR data streams from all major health systems, with augmentation of records from medical claims data within their respective states (see table 1). Collectively, these centres cover approximately 36 million patients, although the exact patient numbers will only be available when overlap in patient population across centres is determined.

    Figure 1

    Map of counties included in the DiCAYA network by clinical centre. DiCAYA, Diabetes in Children, Adolescents and Young Adults.

    Table 1

    Centre descriptions

    The DiCAYA Network has two primary aims, including (1) to refine and validate EHR-based computable phenotype algorithms for accurate identification of incident and prevalent T1D and T2D among two age groups, youth (<18 years of age) and young adults (18 to <45 years of age), according to age, sex, race/ethnicity and geography and (2) to estimate the incidence and prevalence of T1D and T2D among youth and young adults and trends therein between 2018 and 2024, according to age, sex, race/ethnicity and geography. The protocol for achieving aims 1 and 2 is described in detail below. Ultimately, the network aims to augment diabetes surveillance capacity in the USA and assess performance of EHR-based surveillance with respect to appropriate surveillance performance metrics (eg, simplicity, data quality, completeness, acceptability, accuracy, representativeness and timeliness).22

    Development of computable phenotype definitions

    Aim 1 encapsulates the foundational research needed for the DiCAYA Network to assess the validity of a set of computable phenotype definitions, derived from data that can be processed from EHR systems, to determine diabetes status and type among youth and young adults. The performance of the computable phenotype definitions will be assessed by measuring sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) of the phenotypes against the gold standard of a sample of manually abstracted medical charts. Performance will be assessed for a number of computable phenotype definitions from the literature that can be implemented with the available EHR data, leading to refinement and ultimately the identification of one or more valid computable phenotypes. This process will be harmonised across centres, facilitated by a standardised REDCap (Research Electronic Data Capture) abstraction form and manual of procedures.

    Applying computable phenotype definitions proceeds through a sequence of steps, based on previously published methods18 (figure 2). All individuals with any indication in the EHR of possible diabetes in each centre’s source populations (termed ‘wide net’) are identified by applying the following criteria during the time window (the index surveillance year for state-based and membership-based centres, index surveillance year and prior 2 years for health system-based centres) in the respective age group (0–17, 18–44 years): (1) ≥1 haemoglobin A1c≥6.5%; (2) ≥1 fasting glucose ≥126 mg/dL; (3) ≥1 random plasma glucose≥200 mg/dL; (4) ≥1 diabetes-related diagnosis code from an inpatient or outpatient encounter (online supplemental file 1) or (5) ≥1 prescribed, administered or dispensed medication that is typically indicated for the treatment of diabetes (online supplemental file 2). The wide net was designed to have maximum sensitivity, to avoid missing any true diabetes cases.18

    Figure 2

    Computable phenotype (CP) for diabetes (DM) flow chart.

    Next, the primary computable phenotype for presumed diabetes will be applied to the wide net population. The computable phenotype is defined as those with at least one diabetes diagnosis code (International Classification of Disease (ICD)-10-CM: E08–E11, E13) (online supplemental file 1) within the given time windows, based on a method that has been previously used in a cohort of individuals with youth-onset diabetes.23 While an individual could have a code for gestational diabetes in the EHR, a code for gestational diabetes would not be sufficient to be classified as presumed diabetes. Consistent with prior literature, diabetes type (type 1, type 2, other) will be defined based on the proportion of diabetes type-specific diagnosis codes (type 1, type 2 or other) among total diabetes codes, using plurality to assign type.23 In ties, type 1 is given preference over type 2, and type 2 is given preference over others.

    To calculate diabetes incidence rates, computable phenotypes to distinguish newly diagnosed diabetes cases from existing cases will also be defined and validated. New diabetes cases will be defined as those who met the presumed diabetes case definition (ie, those who had at least one diagnosis code for diabetes) for the first time in the index surveillance year (ie, no prior diabetes diagnosis). We will assess whether also to require a record of an earlier healthcare encounter, from before the date of diabetes diagnosis, when determining incidence in the healthcare systems or geographical surveillance areas. We will make this assessment using sensitivity analyses, chart reviews and data available from the EHR (eg, each individual’s first-ever encounter date; each individual’s first diabetes diagnosis code date; and each individual’s last healthcare encounter date before the first diabetes diagnosis date).

    All working computable phenotype definitions will be iteratively refined over the course of the project based on results of the validation and refinement study analyses. We will conduct a manual chart review on a subset of patients who meet the wide net criteria. Sensitivity, specificity, PPV and NPV for the working computable phenotypes will be calculated among the wide net patients for whom true diabetes status has been determined through the chart review. Given that the wide net includes individuals with any evidence of diabetes (ie, medication, laboratory measures and diagnoses), we will have the ability to compare multiple phenotype algorithms (including those that use medication or laboratory-based criteria) to wide net patients with completed chart reviews. Sensitivity will be calculated as the true positives (classified as having diabetes by chart review and by computable phenotype) divided by all those classified as having diabetes by manual review (including true positives and false negatives or those classified as having diabetes by manual review but not by computable phenotype). Specificity will be calculated as the true negatives (classified as not having diabetes by chart review and by computable phenotype) divided by all those classified as not having diabetes by manual review (including true negatives and false positives, or those classified as having diabetes by computable phenotype but not by manual review). PPV will be calculated as the true positives divided by all those classified as having diabetes by computable phenotype (including true positives and false positives). Finally, NPV will be calculated as the true negatives divided by all those classified as not having diabetes by computable phenotype (including true negative and false negatives). These performance measures will be calculated overall and by subgroup (age, sex and race/ethnicity).

    In preparation for the validation and refinement study, the network calculated the necessary sample sizes for the chart reviews using separate component A and component B detectable effect sizes for differences in sensitivity estimates across computable phenotypes, defining the overall detectable effect as the maximum of the two detectable effect sizes. The use of the wide net for validation reduces the number of charts that we would have to manually review to identify a sufficient number of true diabetes cases for measurement of sensitivity. In order to compare computable phenotypes from the literature, a correlation of 0.707 (R2=0.5) between phenotypes was assumed.24–26 To achieve an overall 80% power to detect small differences in sensitivity and specificity, the network will perform manual chart review on approximately 2600 wide net patients per component. This sample will be allocated across centres using a minimum of 400 individuals per centre, with the remaining sample proportional to the size of the centre’s wide net population, up to a maximum of 750 individuals. Samples will be selected using stratified random sampling, with oversampling by race/ethnicity to achieve 10% Asian and Pacific Islander and 20% black individuals in the total sample (table 2) to facilitate evaluation of validity by race. Validity will also be assessed by age, sex and ethnicity.

    Table 2

    Chart review sampling assignments for the computable phenotype refinement analysis

    To accomplish aim 2, estimation of incidence and prevalence of T1D and T2D among youth and young adults, the computable phenotypes developed in aim 1 will first be applied to each centre’s EHR sample to define the numerators. Details on the EHR data used at each centre are outlined in table 1. Next, denominators for prevalence and incidence estimates will be defined using different methods based on type of centre. For geographical-based centres that draw from all major health systems in their respective states, the state civilian, non-institutionalised population will serve as the denominator, as determined using the 2020 US census data and the CDC National Center for Health Statistics’ race-bridged post-census estimates of resident US population. Denominators for health system centres will be generated in two ways: (1) using utilisation data and (2) using US census population-based data. To capture the health system utilisation denominators, these centres will first define the number of unique patients with at least one health system encounter during a 3-year window that includes the index year and the two previous years, a time window selected based on national estimates of frequency of healthcare utilisation in the target age groups.27 Based on patients’ latest addresses at the start of the index year, health system centres will generate the coverage for each county represented in their utilisation data, defined as the number of unique patients divided by the appropriate population size for the county (including in subpopulations by sex, race/ethnicity and age). The network will evaluate different inclusion criteria for counties under surveillance, including coverage level and geographical contiguity/propinquity to the health system. Population-based denominators for the selected counties will be defined using an average of 3 years (index year and two prior years). Finally, for the membership-based centre, the number of members of the health plan as of 1 January of the index year, determined through administrative databases, will serve as the denominator.

    For all centres, relevant inclusion and exclusion criteria will be applied to ensure that the denominators represent the at-risk population defined in the numerator. Preliminary analyses compared the population living in the DiCAYA counties to the rest of the US by age, sex, race, ethnicity and socioeconomic status (SES). Initial results show that distributions by age and sex are comparable, but non-Hispanic black and Hispanic people were over-represented, and non-Hispanic white persons were under-represented (table 3).

    Table 3

    Comparison of population demographics within counties included in the DiCAYA Network versus all US counties

    Unadjusted prevalence and incidence rates overall and for demographic or geographical subgroups will be estimated as the ratio of the numerator divided by the appropriate denominator. The prevalence will be expressed as the number of diabetes cases per 1000 individuals in a defined period. Incidence rates will be expressed as the number of newly diagnosed cases in a calendar year per 100 000 individuals. Incidence estimates will be provided for calendar years 2018 through 2024, and prevalence estimates will be provided for select calendar years from 2018 to 2024. Skew-corrected inverted score tests for binomial distribution will be used to compare two rates and compute 95% CIs. The prevalence and incidence rates will be presented as unadjusted estimates and as race/ethnicity, sex and age-adjusted estimates using Poisson regression. Analyses will be run separately for components A and B.

    Bias correction and estimation

    A key limitation to using EHR data for population health surveillance is the potential for patient populations to be non-representative of the general target population of inference. For example, EHRs have greater coverage among women and children, and those who frequent health systems tend to be more ill than the general population.19 In the analysis of non-probability samples such as EHRs, two main methodological frameworks may be used to estimate population quantities.28–30 In the quasi-randomisation framework, pseudoinclusion probabilities are estimated based on covariates available for all population units and used to correct for selection bias. In contrast, in the superpopulation modelling approach, a statistical model is assumed for the outcome of interest in the non-probability sample and applied to the target population. Both quasi-randomisation and superpopulation modelling rely to varying degrees on auxiliary data from external surveys or administrative sources. Multilevel regression with poststratification (MLRP) is a variation on superpopulation modelling that has often been used in political science. Using MLRP with a highly non-representative survey sample from the Xbox gaming platform, Wang et al31 were able to predict the 2012 US presidential election.31 Although MLRP may be conducted in a frequentist or in a Bayesian setting, the latter may be well suited to handle issues of data sparsity. The DiCAYA Network will implement MLRP to produce estimates of incidence and prevalence. In sensitivity analysis, we will explore existing Bayesian hierarchical models that have been used in survey samples32–34 and will compare them to common survey methods such as raking and poststratification35 that help correct for non-representativeness.

    The DiCAYA Network will implement MLRP to produce estimates of incidence and prevalence. As sensitivity analyses, the network will also apply additional bias-adjustment methods, including propensity score weighting, poststratification, empirical Bayesian hierarchical modelling and geospatial small-area estimation.32 36 Through the bias correction methods, the network will generate estimates of prevalence and incidence rates of diabetes (type insensitive (type 1, type 2 or other), type 1 and type 2) in youth and young adults by various demographic characteristics, including race/ethnicity, age and sex.

    Patient and public involvement

    Patients and the public were not involved in the development of this protocol.

    Ethics and dissemination

    The DiCAYA Network is well positioned to lead a critical advancement in surveillance of diabetes and diabetes types in youth and young adults. While validation studies are needed, EHR-based surveillance systems offer an opportunity to mount more efficient systems than methods used for traditional disease surveillance. The use of existing data and potentially automated case ascertainment (ie, computable phenotypes) make EHR-based surveillance timely and flexible, critical features of a public health surveillance system.21 DiCAYA’s large and diverse population will facilitate the estimation of diabetes prevalence and incidence by type for key demographic subgroups of youth and young adults in a large set of geographical regions across the USA. Through dissemination of the methods and results, we will inform future strategies for conducting nationwide EHR-based surveillance of diabetes.

    Each DiCAYA centre and the CoC received approval from their local institutional review boards for this protocol. To facilitate network-wide analyses at the CoC, each centre executed a data use agreement with the CoC that permits the sharing of EHR data elements with the CoC (online supplemental file 3). Data transfers between centres and the CoC are conducted via a secure file transfer protocol. The CoC manages these data centrally on a secure central platform. Centres have access to their own individual-level data and aggregate data from the network.

    There are a number of potential limitations to using EHR data for population health surveillance. First, EHR data are limited to individuals affiliated with the reporting health systems, and these populations may differ from the general population for a variety of reasons, including services received, health insurance coverage and health insurance types.37 EHRs include data on the subset of the population that seeks care, potentially biasing EHRs toward greater coverage of women, children and individuals who are more ill.37 Moreover, the fragmentation of healthcare in the USA implies that not all health conditions of a given individual are reflected in the EHR under study. DiCAYA’s geographical-based centres are likely less vulnerable to this limitation, given their state-wide coverage and use of multiple data sources. The membership-based centre is also less vulnerable, given that all aspects of each member’s healthcare and services are captured in the EHR, and members have a unique medical record number that does not change if members leave and rejoin the health plan. DiCAYA will deploy MLRP and Bayesian hierarchical modelling28 33 to minimise some of these biases. For a target population of interest, MLRP can combine data in EHRs with rich information from auxiliary sources like the census. MLRP may be especially useful when it is reasonable to hypothesise that selection into the EHR sample is not associated with missed outcomes from excluded individuals after accounting for observed information (ie, missing at random29). However, the performance of MLRP may be sensitive to issues of model misspecification. A Bayesian framework may be conducive to the more complex scenario when the selection process is missing not at random29 and the underlying health status is plausibly associated. Second, while the use of computable phenotypes based on discrete and easily extractable EHR data (eg, diagnoses codes, laboratory values and medications) is essential for large-scale and efficient surveillance, this approach does not leverage potentially informative free text clinician notes. Therefore, in developing our computable phenotypes, we will assess their performance as compared with manual chart review, and some sites will explore the use of natural language processing of free text data from the EHR to automate identification of key variables for diagnosing diabetes cases. Third, our proposed primary computable phenotype may result in some misclassification of disease, particularly among individuals who were initially misdiagnosed (eg, diagnosed as T2D before T1D was ultimately diagnosed or vice versa). Sensitivity of computable phenotypes using diagnoses alone (ie, not laboratory or medication evidence) may also be limited. Prior work has demonstrated, for example, that including laboratory results increases the sensitivity of diabetes computable phenotypes.38 We will examine the extent of misclassification as well as the sensitivity, specificity and PPV and NPVs of all working computable phenotype definitions and anticipate that the computable phenotypes will be iteratively refined over time based on results of the validation analyses. Lastly, an inherent limitation of using EHR data is that data were collected for a different purpose (ie, clinical care, billing and operations) and thus lack the rigour and standardisation of traditional research data.

    There are also some limitations inherent to the DiCAYA Network, as constructed. First, the DiCAYA Network is designed to provide prevalence and incidence estimates on populations in care who have been screened, treated and/or diagnosed with diabetes, potentially representing populations with higher SES than the general population.39 This could result in prevalence estimates that are lower than in the general population, given the higher prevalence of diabetes40 among individuals with lower SES. Second, while DiCAYA has more geographical coverage than prior specialised surveillance or pilot efforts, there remains limited coverage in the Northeast, Northwest, North Central and Midwest regions of the USA. Conversely, in some parts of the USA, DiCAYA centres serve overlapping geographies, potentially leading to overlapping patient populations. For centres with overlapping geographies, we will conduct sensitivity analyses to determine the impact of removing data from overlapping centres. In addition, while SEARCH provided important insights into the great burden of diabetes among American Indian youth,41 DiCAYA will be limited in its ability to conduct surveillance in the American Indian population. Finally, completeness of case ascertainment is a key characteristic of surveillance that can be assessed when there is a second, independent source of cases.42 In the DiCAYA Network, only the state-based sites in South Carolina (component A) and Colorado (components A/B), have data sources required to assess completeness for a select number of years, capitalising on their prior SEARCH infrastructure and within-state network design to allow for complete coverage of complementary healthcare utilisation across hospital systems within the state over time. A subset of DiCAYA’s health system centres gather data from multiple health systems, through participation in a health information exchange network or CRNs (eg, INSIGHT CRN) that may facilitate more complete case ascertainment than a single health centre serving part of the population in a geographical area.43

    Despite these limitations, the DiCAYA Network offers several important strengths. The diversity of clinical centres that comprise the DiCAYA Network allows for the development of surveillance methodology that is generalisable to a variety of settings with access to EHR data. As of 2019, about three-quarters of office-based physicians and nearly all non-federal acute care hospitals in the USA had adopted a certified EHR system.44 The WHO reported that 47% of countries had national EHR systems, as of 2016.45 By developing and validating computable phenotypes across a range of settings in distinct geographies, DiCAYA will publish automated approaches to case ascertainment of T1D and T2D that can be replicated broadly, in the USA and abroad. These EHR-based approaches can be adapted by other healthcare systems and applied to common data models46 (eg, PCORnet,47 Observational Medical Outcomes Partnership48), which would facilitate rapid dissemination of the DiCAYA surveillance protocol. Similarly, the methods for identifying surveillance denominators from patient populations will be applicable to a range of healthcare delivery settings and chronic disease outcomes.

    In addition to delivering surveillance estimates and advancing surveillance methodology, the work of the DiCAYA Network will identify large cohorts of youth and young adults with incident and prevalent diagnosed T1D and T2D on whom the network has access to longitudinal EHR data. The deidentified data, code and other materials used in this study are available for use from the DiCAYA Network for ancillary studies or in collaboration with DiCAYA Network sites, pending review and approval by the network’s Publications and Presentations Committee. The data available on these cohorts can inform future research on risk factors for diabetes onset and complications in these age groups. Importantly, these demographically and geographically diverse cohorts can be used to conduct future research on racial, ethnic and geographical disparities of diabetes in youth and young adults.

    Ethics statements

    Patient consent for publication

    Acknowledgments

    The research reported in this publication was conducted using data from the DiCAYA (Assessing the Burden of Diabetes by Type in Children, Adolescents, and Young Adults) Network. DiCAYA is funded by the Centers for Disease Control and Prevention (CDC) (DP20-001) and the National Institute of Diabetes and Digestive and Kidney Diseases to modernize diabetes surveillance efforts using electronic health record data and advanced statistical analysis. This study includes data from the following institutions: University of South Carolina, University of Colorado Denver, Children’s Hospital of Philadelphia, University of Florida, Lurie Children’s Hospital, Kaiser Permanente Southern California, Geisinger, and Indiana University-Purdue University Indianapolis. The research reported in this study was also conducted using PEDSnet and the OneFlorida+ Clinical Research Network and several other clinical research networks in the project led by Lurie Children’s Hospital. PEDSnet, A Pediatric Learning Health System, includes data from the following PEDSnet institutions: Children’s Hospital Colorado, Children’s Hospital of Philadelphia, Cincinnati Children’s Hospital Medical Center, Lurie Children’s Hospital, Nationwide Children’s Hospital, Nemours Children’s Health, and Seattle Children’s Hospital. PEDSnet is a Partner Network Clinical Data Research Network in PCORnet, the National Patient-Centered Clinical Research Network, an initiative funded by the Patient-Centered Outcomes Research Institute (PCORI). OneFlorida+ is a collaboration among researchers, clinicians and patients in Florida, Georgia, and Alabama to create an enduring infrastructure for a wide range of health research, including pragmatic clinical trials, comparative effectiveness research, implementation science studies, observational research, and cohort discovery. Network partners include the University of Florida, Florida State University, the University of Miami, the University of South Florida, Emory University in Atlanta, and the University of Alabama at Birmingham, along with the six universities’ affiliated health systems and practices. Other partners include AdventHealth (Orlando), Tallahassee Memorial HealthCare, Tampa General Hospital, Bond Community Health (Tallahassee), Community Health IT (Kennedy Space Center), Nicklaus Children’s Hospital (Miami), Capital Health Plan (Tallahassee), Bendcare (Boca Raton, Florida) and the Florida Agency for Health Care Administration, which oversees the Florida Medicaid Program. OneFlorida+ is also a network partner of PCORnet. The Lurie Children’s Hospital DiCAYA project is composed of health care institutions from multiple clinical research networks: The INSIGHT Clinical Research Network (CRN) is a collaborative initiative which integrates clinical and social determinants of health data from over 15 million patients within New York City’s leading health systems. As a member of PCORI’s PCORnet, INSIGHT operates as one of the largest and most diverse CRNs. The INSIGHT network infrastructure is built to convene and meaningfully engage with patients, caregivers, families, researchers, health system leaders, clinicians, and funders to ensure a universal focus on patient-centered research and health equity. This work is supported in-part by the PCORI PCORnet grant to the INSIGHT Clinical Research Network (grant # RI-CORNELL-01-MC). The approach described in this manuscript was developed in partnership with Research Action for Health Network (REACHnet), funded by PCORI (PCORI Award RI-LPHI-01-MC). REACHnet is a partner network in PCORnet, which was developed with funding from PCORI. The Accelerating Data Value Across a National Community Health Center Network (ADVANCE) Clinical Research Network is a member of PCORnet. ADVANCE is a multicenter collaborative led by OCHIN (not an acronym) in partnership with Fenway Health, Health Choice Network, and Oregon Health & Science University. The Chicago Area Patient-Centered Outcomes Research Network (CAPriCORN) is a partnership between healthcare and research institutions, patients, patient advocates, clinicians, community-based organisations (CBOs), and non-profits committed to enabling and delivering patient-centered clinical research. A Patient Community Advisory Committee (PCAC) has worked with CAPriCORN since its inception to elevate the patient voice in research. CAPriCORN’s mission is to develop, test, and implement clinical research in order to improve health care quality, health outcomes, and health equity for the diverse populations of Chicagoland and the surrounding states. Johns Hopkins is a member site of the PaTH CRN. PaTH is a Partner Network in PCORnet, the National Patient-Centered Clinical Research Network. PCORnet has been developed with funding from the Patient-Centered Outcomes Research Institute (PCORI). PaTH’s participation in PCORnet was funded through PCORI Award (RI-PITT-01-PS1).

    References

    Supplementary materials

    Footnotes

    • Twitter @dpugrad01

    • Collaborators The DiCAYA Study Group includes: Children’s Hospital of Philadelphia/PEDSnet: Charles Bailey, MD, PhD (MPI); Christopher Forrest, MD, PhD (MPI); Levon Utidjian, MD; Mitch Maltenfort, PhD; Amy Shah, MD (for CCHMC/PEDSnet); Eneida A. Mendonca, MD, PhD (for CCHMC/PEDSnet); G. Todd Alonso, MD (for Colorado/PEDSnet); Sara Deakyne-Davies, MPH (for Colorado/PEDSnet); Tim Bunnell, PhD (for Nemours/PEDSnet); Anne Kazak, PhD (for Nemours/PEDSnet); Melody Kitzmiller, BS (for NCH/PEDSnet); Manmohan Kamboj, MD (for NCH/PEDSnet); Dimitri Christakis, MD, MPH (for SCH/PEDSnet); Daksha Ranade, MPH, MBA (for SCH/PEDSnet). Geisinger: Annemarie G. Hirsch, PhD, MPH (PI); Joseph J. DeWalle, BS; H. Lester Kirchner, PhD; Meredith Lewis, MS; Dione G. Mercer, BS, BA; Cara M. Nordberg, MPH; Amy Poissant, BS; Brian S. Schwartz MD, MS. Indiana University/Regenstrief: Brian E. Dixon, PhD, MPA (PI); Shaun Grannis, MD, MS; Seho Park, MA, PhD, MS; Katie Allen (for Regenstrief Institute); Anna Roberts, MS (for Regenstrief Institute); Nimish Valvi, PhD, MPH (for Regenstrief Institute); Jeff Warvel (for Regenstrief Institute); Ashley Wiensch, MPH (for Regenstrief Institute); Tamara Hannon, MD (for Indiana University). Kaiser Permanente of Southern California: Kristi Reynolds, PhD, MPH (PI); John Chang, MPH; Eva Lustigova, MPH; Don McCarthy, MA; Matthew T. Mefford, PhD; Rong Wei, MA; Hui Zhou, PhD. Lurie Children's Hospital: Marc Rosenman, MD (PI); Lu Zhang, PhD; George Lales, MS; Anthony Wong, PhD; Allison Zelinski, MS (for Children's Hospital of Philadelphia); Yuan Luo, PhD (for Northwestern University); Mark Weiner, MD (for Weill Cornell); Pedro Rivera, MS (for OCHIN); Thomas Carton, PhD, MS (for Louisiana Public Health Institute); Elizabeth Nauman, MPH, PhD (for Louisiana Public Health Institute); Harold P. Lehmann, MD, PhD (for Johns Hopkins University); Victor W. Zhong, PhD (for Shanghai Jiao Tong University). NYU Langone Health: Jasmin Divers, MS, PhD (MPI); Lorna E. Thorpe, MPH, PhD (MPI); Meredith Akerman, MS; Rebecca Anthopolos, DrPH; Stefanie Bendik, MPH; Sarah Conderino, MPH; Andrew Fair, ScM, MS; Jessica Guillaume, MPH; Shahidul Islam, DrPH, MPH; Alan Jacobson, MD; David C. Lee, MD; Chinyere Okpara, MS; Anand Rajan, MPH; Andrea Titus, PhD. University of Colorado Denver: Tessa Crume, PhD, MPH (MPI); Dana Dabelea, MD, PhD (MPI); Theresa Anderson, MS; Anna Bellatorre, PhD, MA; Rebecca Conway, PhD, MPH; Toan Ong, PhD; Jack Pattee, PhD; Shawna Burgett, PhD; Bethelhem (Betty) Shiferaw, MPH. University of Florida: Jiang Bian, PhD (MPI); Yi Guo, PhD (MPI); Hui Shao, MD, PhD (MPI); Elizabeth A. Shenkman, PhD (MPI); Sarah J. Bost, MSLS; William T. Donahoo, MD; William R. Hogan, MD, PhD; Piaopiao Li; Tianchen Lyu, MS; Mattia Prosperi, PhD; Yonghui Wu, PhD. University of South Carolina: Angela D. Liese, PhD, MPH (MPI); Bo Cai, PhD, MSc; Lisa Knight, MD, MBA; Caroline Rudisill, MSc, PhD; Jessica Stucker, MSW; Deborah Bowlby, MD (for Medical University of South Carolina); Jihad S. Obeid, MD (for Medical University of South Carolina); Elaine Apperson MD (for Prisma Health); Alex Ewing, PhD, MPH (for Prisma Health). Centers for Disease Control and Prevention, Division of Diabetes Translation: Giuseppina Imperatore, Meda Pavkov, Deborah B. Rolka.

    • Contributors All members of The DiCAYA Study Group contributed to the conception and design of this protocol; AGH, SC, SB and LET drafted the manuscript; AGH, SC, TLC, ADL, AB, SB, JD, RA, BED, YG, GI, DCL, KR, MR, HS, LU and LET substantively revised the manuscript. AGE and SC contributed equally and are co-first authors.

    • Funding This work was supported by the Centers for Disease Control and Prevention and the National Institute for Diabetes and Digestive and Kidney Diseases. U18DP006521 Children’s Hospital of Pennsylvania, U18DP006512 University of Florida, U18DP006509 Geisinger, U18DP006500 Indiana University–Purdue University at Indianapolis, U18DP006513 University of South Carolina, U18DP006506 Kaiser Foundation Hospitals, U18DP006693 Lurie Children’s, U18DP006694 Lurie Children’s, U18DP006517 University of Colorado Component-A, U18DP006518 University of Colorado Component-B, U18DP006510 NYU Long Island School of Medicine.

    • Disclaimer The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. The content of this publication is solely the responsibility of the authors and does not necessarily represent the views of PCORI or of other organisations participating in, collaborating with, or funding REACHnet or PCORnet.

    • Map disclaimer The inclusion of any map (including the depiction of any boundaries therein), or of any geographic or locational reference, does not imply the expression of any opinion whatsoever on the part of BMJ concerning the legal status of any country, territory, jurisdiction or area or of its authorities. Any such expression remains solely that of the relevant source and is not endorsed by BMJ. Maps are provided without any warranty of any kind, either express or implied.

    • Competing interests None declared.

    • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

    • Provenance and peer review Not commissioned; externally peer reviewed.

    • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.