Article Text

Download PDFPDF

Improving the geographical precision of rural chronic disease surveillance by using emergency claims data: a cross-sectional comparison of survey versus claims data in Sullivan County, New York
  1. David C Lee1,2,
  2. Justin M Feldman2,
  3. Marcela Osorio1,
  4. Christian A Koziatek1,
  5. Michael V Nguyen1,
  6. Ashwini Nagappan1,
  7. Christopher J Shim3,
  8. Andrew J Vinson1,
  9. Lorna E Thorpe2,
  10. Nancy A McGraw4
  1. 1 Ronald O. Perelman Department of Emergency Medicine, New York University School of Medicine, New York City, New York, USA
  2. 2 Department of Population Health, New York University School of Medicine, New York City, New York, USA
  3. 3 California Northstate University College of Medicine, Elk Grove, California, USA
  4. 4 Sullivan County Public Health Services, Liberty, New York, USA
  1. Correspondence to Dr David C Lee; David.Lee{at}


Objectives Some of the most pressing health problems are found in rural America. However, the surveillance needed to track and prevent disease in these regions is lacking. Our objective was to perform a comprehensive health survey of a single rural county to assess the validity of using emergency claims data to estimate rural disease prevalence at a sub-county level.

Design We performed a cross-sectional study of chronic disease prevalence estimates using emergency department (ED) claims data versus mailed health surveys designed to capture a substantial proportion of residents in New York’s rural Sullivan County.

Setting Sullivan County, a rural county ranked second-to-last for health outcomes in New York State.

Participants Adult residents of Sullivan County aged 25 years and older who responded to the health survey in 2017–2018 or had at least one ED visit in 2011–2015.

Outcome measures We compared age and gender-adjusted prevalence of hypertension, hyperlipidaemia, diabetes, cancer, asthma and chronic obstructive pulmonary disease/emphysema among nine sub-county areas.

Results Our county-wide mailed survey obtained 6675 completed responses for a response rate of 30.4%. This sample represented more than 12% of the estimated 53 020 adults in Sullivan County. Using emergency claims data, we identified 34 576 adults from Sullivan County who visited an ED at least once during 2011–2015. At a sub-county level, prevalence estimates from mailed surveys and emergency claims data correlated especially well for diabetes (r=0.90) and asthma (r=0.85). Other conditions were not well correlated (range: 0.23–0.46). Using emergency claims data, we created more geographically detailed maps of disease prevalence using geocoded addresses.

Conclusions For select conditions, emergency claims data may be useful for tracking disease prevalence in rural areas and providing more geographically detailed estimates. For rural regions lacking robust health surveillance, emergency claims data can inform how to geographically target efforts to prevent chronic disease.

  • epidemiology
  • public health
  • health services administration and management

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Strengths and limitations of this study

  • Validates the use of emergency claims data to perform geographically detailed surveillance in rural settings.

  • Provides a standard for estimating disease prevalence at a local level by performing a county-wide mailed survey.

  • Limited by the accuracy of diagnosis codes found in claims data and is more accurate for conditions likely to be captured during emergency visits.

  • Has the potential to improve rural health surveillance by using existing data to track the burden of chronic diseases.


In New York State, Sullivan County has been ranked 61 out of 62 by the County Health Rankings based on the rates of premature death and quality of life (poor overall, physical, mental health and low birth weights) just behind Bronx County in New York City.1 Located just 2 hours northwest of New York City, Sullivan County is rural and more than 70% of its residents are white. Like many rural areas of America, Sullivan County has faced significant economic challenges, along with disparities in healthcare access.2 3 Although some of the most pressing health problems can be found in rural America, their public health institutions lack timely data needed to provide geographically detailed chronic disease surveillance.4 5 Nationwide health surveys, such as the Behavioral Risk Factor Surveillance System (BRFSS), often have inadequate coverage of these rural regions, and efforts to use models to extrapolate estimates of disease prevalence have questionable validity.6 7

In recent years, there has been increasing interest in using alternative sources of data to track chronic disease prevalence.8–10 Approaches using claims data and electronic health records have emerged among the potential options.11 These data are collected routinely by state agencies and may provide a cost-effective, ready-to-analyse alternative to expensive and time-intensive traditional survey methods.12 For instance, one in five Americans report having visited an emergency department (ED) in the past year, which provides a 20% population sample with a single year of data.13 14 However, these approaches need to be validated before widespread dissemination because, unlike surveys, they are not random population samples and may therefore not be representative.

There are several challenges that make estimation of chronic disease prevalence in rural areas difficult. The Centers for Disease Control and Prevention (CDC) has started to use modelling approaches with Bayesian and spatial smoothing with BRFSS data to estimate county-level disease prevalence in rural areas.15 But, few traditional health surveys have been performed in these areas with sample sizes adequate for sub-county level area estimation.16 17 In addition, there is similarly sparse data on the sociodemographic composition of these rural regions (as evidenced by wide confidence intervals for sub-county estimates of race and ethnicity). Furthermore, ZIP codes, county borders and other geographical units are less likely to align in rural areas, limiting the possibility of attributing aggregated data to specific regions.18 In addition, certain traditional survey techniques used to refine estimates based on underlying demographic characteristics (eg, statistical weighting, adjustment or stratification) that are often performed with Census data cannot be readily applied in rural areas especially if there is insufficient data.19

The goal of this study was to perform a comprehensive health survey of a single rural county in the USA. We report the results of a geographically distributed health survey delivered by mail to households within Sullivan County, New York. We then compare the disease prevalence estimates obtained from these surveys with a novel method that uses emergency claims data to identify areas with a higher burden of chronic disease.14


Study design

We administered a brief health survey by mail throughout Sullivan County during Fall 2017 and Spring 2018 to a random sample of residential addresses. We used survey data to estimate the age and gender-standardised prevalence of several chronic diseases at a sub-county level. We also estimated disease prevalence using the comprehensive, all-payer New York Statewide Planning and Research Cooperative System (SPARCS) claims database. Our alternative measure was the proportion of ED patients with ≥1 diagnostic code for a given disease on ≥1 emergency visit during the period 2011–2015.14 In each method, residents with an address located at a nursing or correctional facility were excluded to estimate prevalence for the non-institutionalised population. This study was approved by NYU School of Medicine’s Institutional Review Board.

Mailed health surveys

To generate a sampling frame for our mailed health survey, we obtained point and parcel data for all mailing addresses in Sullivan County from the New York State GIS Clearinghouse ( This data source was selected because it contained property class and land use data. Addresses were filtered to include any residential listing not marked as seasonal housing. We also included commercial addresses listed as apartments. This list of mailing addresses was then refined using an address verification service ( to select valid, non-vacant mailable addresses.21 As a substantial proportion of residents do not receive delivered mail in Sullivan County, we also queried the address verification service to find all valid, mailable PO boxes in the county. The final sampling frame consisted of 39 084 households located across 56 ZIP codes within Sullivan County.

Given the sparse population in some areas of the rural county, less-populated ZIP codes were oversampled to maximise the geographical coverage of the survey over the entire county. To do so, we used a quota sampling strategy. We mailed surveys to a random sample of 750 households for each ZIP code in our sampling frame. In ZIP codes with fewer than 750 households, all households were mailed a survey. Each health survey consisted of questions that first confirmed residence within Sullivan County and age over 18 years, and then asked a brief selection of health and demographic questions derived from the BRFSS (see online supplementary file 1 for mailed health survey).22 For households with multiple residents, we asked that only one adult respond to the survey. We mailed a survey to 24 141 or 62% of the households in our sampling frame. Survey respondents were offered a $10 gift card for participation and a stamped return envelope was enclosed in the surveys. The Sullivan County Public Health Department also made local news outlets aware of the survey and fielded phone calls from local residents to confirm that the survey was legitimate.

Supplemental material

Emergency claims data

Using the SPARCS, all-payer claims database, we identified all adult patients who had visited an ED located at a general acute care hospital in New York State between 2011 and 2015. We included all patients with a PO box or home address located within the borders of Sullivan County. Patients with more than one ED visit either at the same hospital or different hospitals were counted as a single observation by collapsing multiple visits using unique identifiers from SPARCS. The result was a listing of unique Sullivan County residents who had accessed emergency care at least once during the 5-year period.

Study outcomes

Our primary outcome was the prevalence of chronic disease at a sub-county level as identified by our mailed health survey or estimated using emergency claims data. In our mailed survey, respondents were asked if they had ever been diagnosed with hypertension, hyperlipidaemia, cancer, diabetes, asthma, chronic obstructive pulmonary disease (COPD) or emphysema. In our analysis of emergency claims data, all available primary and secondary diagnosis codes across visits were scanned by individual for the presence of ≥1 diagnosis code during ≥1 ED visit for these same conditions. The codes from the International Classification of Diseases (ICD-9 and ICD-10) used were: hypertension (401–405 or I10–I16), hyperlipidaemia (272 or E78), diabetes (250 or E10–E11), cancer (140–239 or C00–C96), asthma (493 or J45), and COPD/emphysema (491–492 or J43–J44). Thus, prevalence was estimated as a proportion: the number of unique ED patients with each of the listed conditions divided by the total number of unique ED patients.

Statistical analysis

To generate the sub-county areas in our analysis, we first grouped ZIP codes based on the Census-defined subdivisions (ie, town borders) within Sullivan County. ZIP codes were assigned to these subdivisions based on the largest area of overlap given that ZIP code boundaries do not exactly align with town borders.18 After grouping ZIP codes into these 15 subdivisions, it was found that 10 of these subdivisions had less than 2000 households who received a mailed survey and were thus unlikely to obtain the minimum 500 survey responses, a benchmark set by the CDC for obtaining acceptably narrow CIs for prevalence estimation (online supplementary figure 1).19 Therefore, these less populated subdivisions were systematically merged with each other based on proximity and population density to form four sub-county areas with a sufficient number of sampled households. The result was nine sub-county areas made of five subdivisions with adequate sampling and four areas combining neighbouring subdivisions to attain adequate sampling (see online supplementary table 1 for more details of aggregating ZIP codes into subdivisions and then sub-county areas).

Supplemental material

In aggregating prevalence estimates between ZIP codes to create the sub-county areas, we used two weighting approaches. For the mailed survey, we applied design weights (the inverse probability of selection from the sampling frame) to account for our oversampling of less-populated ZIP codes. For the emergency claims data, we weighted ZIP code prevalence estimates by the inverse of the total number of unique ED patients divided by the Census estimate of adults aged 25 years and older for each ZIP code in Sullivan County to account for known differences in ED usage based on proximity to the nearest hospital.23 Prevalence estimates using both methods were then standardised to the overall age and gender distributions in Sullivan County from the 5-year 2012–2016 American Community Survey (ACS).24 We then calculated Pearson correlation coefficients comparing the prevalence estimates obtained using the two methods at the sub-county level. By convention, the strength of correlation was graded as very strong (0.80–1.00), strong (0.60–0.79), moderate (0.40–0.59), weak (0.20–0.39) and very weak (0.00–0.19).

Geographical analysis

We also performed geographically detailed surveillance using the larger sample of Sullivan County residents identified in emergency claims data. For the subset of patients with a geocodable home address, we calculated unadjusted disease prevalence among their 100 nearest neighbours identified in the population of unique ED patients. We then interpolated rasters from this point data using the inverse squared distance technique. Chronic disease prevalence maps were generated from these unadjusted prevalence estimations for diabetes, asthma and hypertension, with categories based on SDs from the mean. For comparison, these maps were also created based on the 200 nearest neighbours to assess the influence of changing this parameter.

Statistical analyses were performed using Stata V.14.2 (Statacorp, 2015). Geographical analysis and mapping were performed using ArcGIS Desktop V.10.5.1 (ESRI; Redlands, California, USA, 2017).


Mailed survey responses

Of the 24 141 surveys that we mailed to addresses within Sullivan County, approximately 20% were returned to sender even after using an address verification service (online supplementary figure 2). Of the 7241 survey responses received, 216 were missing key demographic information or were otherwise incomplete, 248 were not residents of Sullivan County and 22 were located at a nursing or correctional facility. In addition, only 80 respondents were aged 18–24 years old, which we deemed too few for inclusion in the study. Therefore, we limited study results to adults aged 25 years and older. Using the AAPOR RR2 definition for mail surveys of unnamed persons, our response rate was 30.4%.25

Population characteristics

The county-wide mailed survey received valid responses for 6675 adults or 12.6% of the adult population 25 years and older in Sullivan County. Using 5 years of emergency claims data, we were able to identify 65.2% of the Census-estimated adult population 25 years and older in Sullivan County. In comparison with ACS 2012–2016 Census estimates, survey respondents were notably older (42.5% vs Census estimate of 23.9% aged 65 years and older). In comparison, the population of unique ED patients was slightly younger (39.2% vs Census estimate of 33.0% aged 25–44 years old). A higher proportion of survey respondents were women (60.7% vs Census estimate of 49.2%). Also, a higher proportion of survey respondents were non-Hispanic white (88.7% vs Census estimate of 73.0%). However, the sex and race/ethnicity distributions of the unique ED patient population were similar to Census estimates (table 1).

Table 1

Demographic comparisons among Census estimates and data sources

Prevalence estimates adjusted for age and gender

The county-wide prevalence estimates using emergency claims data was higher than the mailed survey for diabetes, but lower for asthma (table 2). The correlation by sub-county area was very strong for these two conditions at r=0.90 (95% CI: 0.60 to 0.98) and r=0.85 (95% CI: 0.44 to 0.97), respectively. For all other conditions except for diabetes, the county-wide prevalence estimates using emergency claims data was lower than the mailed survey. These correlations were graded across conditions: moderate for hypertension (r=0.46, CI: −0.30 to 0.86) and COPD/emphysema (r=0.42, CI: −0.34 to 0.85), and weak for cancer (r=0.39, CI: −0.37 to 0.84) and hyperlipidaemia (r=0.23, CI: −0.51 to 0.78). Graphs of these correlations are found in online supplementary figure 3, which demonstrate the variability between prevalence estimates especially for conditions with poor correlation by sub-county area. We displayed maps of prevalence estimates for diabetes, asthma and hypertension based on survey results for the sub-county areas analysed in figure 1.

Table 2

Age and gender-adjusted county-level disease prevalence and correlation at a sub-county level

Figure 1

Sub-county estimates of adjusted disease prevalence based on mailed survey responses.

ED surveillance

Among the 34 567 unique patients identified from emergency claims data, 76% had a geocodable home address, 20% were PO box only and 4% were not geocodable but had a ZIP code located fully within Sullivan County. Using the 100 nearest neighbours among patients with a geocodable home address, we estimated unadjusted prevalence at the geocoded location of each patient and created interpolated rasters to provide a more geographically detailed maps of diabetes, asthma and hypertension prevalence (figure 2). These maps were able to identify localised clusters of disease throughout the county with greater geographical detail.

Figure 2

Geographically detailed estimates of unadjusted disease prevalence based on emergency claims data.


The intensity of health problems experienced by residents living in rural areas of the country underscores the need for improving our methods of health surveillance.2 3 Our study findings demonstrate a novel solution that uses emergency claims data to estimate chronic disease prevalence at a sub-county level. These estimates are important for identifying key hotspots of disease, which may reveal previously unexplored risk factors that increase disease burden in rural America and guide efforts to prevent chronic disease in specific geographical areas that experience the worst health outcomes.26 Current health surveillance techniques rely on traditional methods such as telephone-based surveys. Not only are these methods costly and time-intensive, but also due to dramatic shifts in phone use response rates over the past two decades have dropped dramatically from around 36% to 9%.27 28 The sample size of a large national health survey such as the BRFSS is inadequate for generating precise estimates of disease prevalence even at the county level for much of rural America, which is why the CDC has started to use alternative estimation methods to impute prevalence among rural counties.29

Recent efforts to provide greater geographical coverage have focused on approaches that use the data in adequately sampled areas and statistical models to extrapolate disease estimates for poorly sampled areas largely based on sociodemographic factors.16 But many of these techniques have not been validated, and in the few instances when they have been compared, these approaches do not always work as well as expected.6 7 Our mailed health survey found that adjusted diabetes prevalence in Sullivan County was 12.7%. This estimate is much higher than the CDC’s most recent estimate of 9.5% in 2015, which is based on a modelling approach. For a given area, these modelling approaches can be especially imprecise when used to estimate disease prevalence in areas with low response rates, which includes many rural regions.

Other efforts to advance health surveillance methods have experimented with the use of claims data and electronic health records to provide estimates of disease prevalence. A recent study demonstrated that emergency claims data could be used to estimate chronic disease prevalence in New York City, and this approach was validated with results obtained from an annually performed citywide health survey. In this urban study, it was found that conditions including diabetes, hypertension and asthma had correlations of 0.86, 0.88 and 0.77, respectively, when analysed among 34 sub-county areas.14

With our novel method of using emergency claims data to estimate chronic disease burden, we identified health records for a substantial majority of all adults in Sullivan County using 5 years of emergency claims data. Furthermore, the demographic patterns among this population of unique ED patients were much closer to Census estimates than our county-wide mailed survey. Under-representation of certain demographic groups, especially minorities, is a common problem of traditional survey methods that can be adjusted for, as long as geographically matched sociodemographic data exists.30 In rural areas where Census estimates for race and ethnicity often have wide CIs, emergency claims data may provide an alternative population sample that closely mirrors the underlying population in a given region.

For some conditions such as diabetes and asthma, we found strong correlation between the two estimation methods for sub-county disease prevalence. For the other conditions studied, the strength of correlation was weaker. This may be attributable to disease-specific differences in the validity of both ED claims data and self-reported survey data. Prior research has shown that, for both data sources, validity is routinely higher for diabetes and asthma but lower for other conditions such as hyperlipidaemia, with low sensitivity (ie, under-reporting) being the reason for poor correlation.8–10 It should be noted that although some conditions such as COPD are a frequent primary diagnosis for a patient’s ED visit, COPD may not be frequently accounted for as one of the secondary diagnoses, which are included in this ED-based surveillance approach.

Emergency claims data are already widely collected around the country, can capture a large population sample and in some areas include address data that can be used to precisely identify where patients live. By geocoding these addresses, more precise health surveillance can provide detailed maps of disease burden. This granular level of geographical detail is important because localised hotspots of disease might otherwise be hidden as they are averaged out by neighbouring areas of low disease prevalence. However, some important caveats should be understood before employing these methods. There is some variation in how accurately some hospitals capture chronic disease conditions using diagnosis codes.31 In addition, for some parts of rural America, mail is only delivered to PO boxes, therefore the more geographically detailed maps of disease prevalence based on geocoded data may not be accurate in these regions where mail is not delivered directly. Furthermore, our study found substantial variability in prevalence estimates for conditions that may not be well captured by emergency claims data. More research may be needed to determine the best approaches for estimating disease prevalence in rural areas.


Since surveys did not ask respondents to report household size, single-adult households are likely over-represented. Furthermore, we did not specify a method of randomly selecting an adult in households with multiple residents, which may have contributed to bias in our sample. However, we estimated age and gender-standardised rates to the overall population in Sullivan County, which may partially reduce this bias. While our adjustment methods reduced age and gender-specific non-response bias, we were unable to standardise by race and ethnicity due to very small proportion of minorities in several areas of Sullivan County. Given that minorities often have higher rates of chronic disease and tend to have lower response rates, our mailed survey may have underestimated disease prevalence. Although the groups that frequently seek emergency care and those who respond to surveys tend to diverge, there may have been some sort of parallel bias that accounted for the correlations or disease prevalence identified in our study. Our method that used emergency claims data to estimate disease prevalence is subject to many of the limitations associated with the use of administrative data. Fidelity of coding some variables including race, ethnicity and diagnosis codes can vary by hospital and may impact resulting disease prevalence estimates. Also, these claims data are often available about a year after they have been filed, thus there is some lag in reporting. In this study, emergency claims data were collected for 2011–2015, whereas the county-wide survey was performed in 2017–2018.


We found that for select conditions, emergency claims data may be useful for tracking disease prevalence in rural areas and may provide more geographically precise estimates. Given the infrastructure already in place to collect this data, efforts could be focused on collecting more accurate diagnosis codes and more detailed geographical data. This approach could potentially help match the limited public health resources of a rural county to the geographical areas with the highest burden of disease.


The content of the study reflects the views of the authors and not the official position of the Sullivan County Public Health Department nor the NYU School of Medicine.



  • Contributors Study conception and design: DCL, LET and NAM; acquisition of the data: DCL, MO, MVN, AN and AJV; analysis and interpretation of the data: DCL, JMF, MO, CAK, CJS, AJV and LET; drafting of the manuscript: DCL; critical revision of the manuscript for intellectual content: JMF, MO, CAK, MVN, AN, CJS, AJV and LET. DCL is the guarantor of this work and had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

  • Funding This work was supported by the New York State Health Foundation grant number 16-04083.

  • Map disclaimer The depiction of boundaries on this map does not imply the expression of any opinion whatsoever on the part of BMJ (or any member of its group) concerning the legal status of any country, territory, jurisdiction or area or of its authorities. This map is provided without any warranty of any kind, either express or implied.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Ethics approval This study was approved by NYU School of Medicine’s Institutional Review Board.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available upon reasonable request.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.