Objectives Using free-text clinical notes and reports from hospitalised patients, determine the performance of natural language processing (NLP) ascertainment of Framingham heart failure (HF) criteria and phenotype.
Study design A retrospective observational study design of patients hospitalised in 2015 from four hospitals participating in the Atherosclerosis Risk in Communities (ARIC) study was used to determine NLP performance in the ascertainment of Framingham HF criteria and phenotype.
Setting Four ARIC study hospitals, each representing an ARIC study region in the USA.
Participants A stratified random sample of hospitalisations identified using a broad range of International Classification of Disease, ninth revision, diagnostic codes indicative of an HF event and occurring during 2015 was drawn for this study. A randomly selected set of 394 hospitalisations was used as the derivation dataset and 406 hospitalisations was used as the validation dataset.
Intervention Use of NLP on free-text clinical notes and reports to ascertain Framingham HF criteria and phenotype.
Primary and secondary outcome measures NLP performance as measured by sensitivity, specificity, positive-predictive value (PPV) and agreement in ascertainment of Framingham HF criteria and phenotype. Manual medical record review by trained ARIC abstractors was used as the reference standard.
Results Overall, performance of NLP ascertainment of Framingham HF phenotype in the validation dataset was good, with 78.8%, 81.7%, 84.4% and 80.0% for sensitivity, specificity, PPV and agreement, respectively.
Conclusions By decreasing the need for manual chart review, our results on the use of NLP to ascertain Framingham HF phenotype from free-text electronic health record data suggest that validated NLP technology holds the potential for significantly improving the feasibility and efficiency of conducting large-scale epidemiologic surveillance of HF prevalence and incidence.
- health informatics
- cardiac epidemiology
- heart failure
Data availability statement
Data are available upon reasonable request. All raw data for the study will be available upon request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
The article describes the first study to evaluate performance of natural language processing (NLP) using free-text clinical notes and reports stored in electronic health records to ascertain Framingham heart failure phenotype in multiple regionally dispersed hospitals in the USA with different health systems.
NLP performances (sensitivity, specificity, positive-predictive value and agreement) are assessed with the reference standard being manual extraction of prespecified information by trained and certified abstractors, using a highly standardised protocol, with quality assurance programmes in place that monitored accuracy, completeness and repeatability of the process.
The NLP programme used open-source software (clinical Text Analysis Knowledge Extraction System and Python).
A limitation to the study is that it only includes a subset of hospitalised patients at risk for acute decompensated heart failure based on diagnostic codes (International Classification of Disease, ninth revision) and therefore is not representative of the general hospitalised population.
Since the passage of the Health Information Technology for Economic and Clinical Health Act in 2009,1 the use of electronic health records (EHRs) in hospital settings has become nearly ubiquitous. Although in 2008, approximately 9% of hospitals were using EHRs, by 2020 the adoption of EHR use among hospitals is approaching 100%.2 This creates unprecedented opportunities for researchers to automate the process of extracting clinical phenotype from patient medical records through electronic search methods.
Scientific organisations and experts promote leveraging electronic data as beneficial to the future of research, public health surveillance and quality improvement initiatives.3 The Working Group on Epidemiology and Population Sciences established by the National Heart, Lung and Blood Institute identified e-epidemiology as a strategic priority for research, with recommendations for studies to ‘determine the validity, reliability and scalability of electronic tools for data collection’.4 Clinical phenotypes can be efficiently and accurately extracted from EHRs through the application of algorithms integrating structured data elements such as diagnostic codes, clinical laboratory data and medication lists.5 Less well-studied is the use of natural language processing (NLP) of free-text clinical notes stored in EHRs for the ascertainment of complex clinical phenotypes and syndromes.
We focused this study on the use of NLP for the ascertainment of heart failure (HF), a leading cause of hospital admissions and mortality among older adults in the USA.6 HF is a complex clinical syndrome characterised by the heart’s inability to supply blood flow sufficient to meet the needs of the body. It is estimated to affect 5.7 million American adults and its prevalence is expected to rise to 8.4 million by 2030.7 Reflecting the heterogenous nature of HF syndromes, there is no universally accepted diagnostic schema for HF that adequately classifies all patients across this syndrome’s pathophysiology, ranging from HF with reduced left ventricular ejection fraction (LVEF) to HF with preserved LVEF (diastolic dysfunction). Signs and symptoms of HF may differ from patient to patient and clinical judgement is typically required to establish a diagnosis of HF for a given patient. The goal of this study is to determine the extent to which accurate EHR-based extraction of Framingham HF criteria phenotypes and HF event classification8 can be performed in an automated fashion from clinical notes. We sampled inpatient EHR at four geographically dispersed hospitals with disparate healthcare systems for automated processing and used as a benchmark for our performance an established, standardised protocol of record abstraction and classification.7
From 2005 through 2014, the Atherosclerosis Risk in Communities (ARIC) study9 conducted community surveillance of HF hospitalisations, classified according to the Framingham schema,8 for residents aged 55–84 years in four regions in the USA.10 11 To produce annual event rates of HF, eligible hospitalisations from a sample of discharges from acute care hospitals located in ARIC study communities were manually abstracted and events classified according to the presence of the Framingham HF classification criteria.8 A hospitalisation was considered eligible for inclusion based on specific primary or secondary International Classification of Disease, ninth revision, Clinical Modification codes (HF: 428; rheumatic heart disease: 398.91; hypertensive heart disease with congestive heart failure: 402.0, 402.11 or 402.91; hypertensive heart disease and renal failure with HF: 404.01, 404.03, 404.13, 404.91 or 404.93; acute cor-pulmonale: 415.0; chronic pulmonary heart disease, unspecified: 416.9; other primary cardiomyopathies: 425.4; acute oedema of lung, unspecified: 518.4; dyspnoea and respiratory abnormalities: 786.0). Extraction of prespecified information was performed manually by trained and certified abstractors, using a highly standardised protocol, with quality assurance programmes in place that monitored accuracy, completeness and repeatability of the process.12 A stratified random sample of these hospitalisations occurring during 2015 in four ARIC study hospitals in different ARIC study regions was drawn for the study. A randomly selected set of 394 records was employed as the derivation dataset; the remainder was set aside as the validation set (table 1, N=406). There were no statistically significant differences in patient demographics between the derivation and validation datasets.
Patient and public involvement
HF is a leading cause of hospital admissions and mortality among older adults in the USA.6 It is estimated to affect 5.7 million American adults and its prevalence is expected to rise to 8.4 million by 2030.7 Therefore, the study outcomes are likely to be a high priority for patients. However, patients were not directly involved in the study design, conduct or outcomes of the research project.
The primary goal of this study is to determine the accuracy with which EHR-based NLP algorithms can be used to (1) extract Framingham HF criteria variables (table 2) from free-text clinical notes and (2) ascertain the HF phenotype according to the Framingham schema.8 As shown in table 2, HF is present if at least two major Framingham criteria are met, or one major and two minor criteria are met. The study also seeks to assess NLP performance reproducibility in ascertainment of Framingham HF phenotype across the four study hospitals.
Data manually extracted by certified ARIC abstractors following a standardised protocol13 were used as the reference standard to assess the EHR-based performance of NLP. We used the derivation dataset of 394 records from the four study hospitals to develop the NLP algorithms to extract Framingham HF criteria variables. Once the NLP algorithms were optimised, we assessed NLP performance using a separate validation dataset of 406 unique patient records (table 1).
Figure 1 summarises the study design in which analysis of free-text clinical notes stored in EHRs was compared with manually abstracted Framingham HF phenotype criteria variables (reference standard) from hospitalisations occurring in 2015 at four study hospitals enrolled in the ARIC study (table 1). EHR clinical note types used for the analysis included emergency department notes, hospital admission notes, discharge summaries and imaging studies, when available. A structured data element (>4.5 kg weight change during the hospitalisation) was also included.
Extracting HF phenotype criteria from clinical notes in EHRs using NLP
We developed an NLP system using the open-source Apache clinical Text Analysis and Knowledge Extraction Tool14 (clinical Text Analysis Knowledge Extraction System (cTAKES)) and Python15 programming software. cTAKES is an NLP programme specifically designed to analyse free-text clinical notes. It includes specific modules for clinical concept coding and negation status. Concept coding from the Unified Medical Language System16 was used to identify HF phenotype criteria (table 2), such as ‘paroxysmal nocturnal dyspnoea’, and associate them with standardised concept unique identifiers (CUI), such as ‘C1956415’, that can easily be referenced in a Python programme. The cTAKES programme also assigns a negation status to each concept it identifies in the electronic clinical text. For example, the HF criterion ‘no paroxysmal nocturnal dyspnoea’ is processed by cTAKES by first assigning the CUI ‘C1956415’ to the ‘paroxysmal nocturnal dyspnoea’ concept and then assigning a negation flag to the concept if it identifies predefined negation terms, such as ‘no’ or ‘denies’, associated with the concept. Our study required an additional layer of Python code to identify non-standard documentation of HF criteria (eg, the abbreviation ‘PND’ for ‘paroxysmal nocturnal dyspnoea’), as well as augmented negation so that HF signs and symptoms not described as new or worsening were also negated. Figure 2 shows an overview of the NLP pipeline used to extract Framingham HF phenotype criteria from free-text clinical notes stored in study hospital EHRs. For details of the NLP programme (see online supplemental appendix 1).
We computed sensitivity, specificity and positive-predictive value (PPV) as performance metrics to compare EHR-based HF phenotype criteria with the reference standards (manual review by trained ARIC chart abstractors). Using EHR-based NLP Framingham HF phenotype ascertained criteria (table 2), we then calculated the presence or absence of the HF phenotype according to the Framingham8 HF schema for the study population, and compared results with Framingham HF phenotype calculated using manually abstracted Framingham HF criteria from the ARIC study (reference standard). χ2 and Fisher’s exact tests on weighted proportions were used to calculate 95% CIs and p values for EHR-based NLP performance characteristics. All analyses were performed using SAS V.9.4 and Stata/SE V.15.0 software.
EHR performance for extraction of Framingham HF phenotype criteria variables
Table 3 shows the performance of EHR-based NLP abstraction of Framingham HF phenotype criteria from free-text clinical notes, compared with manual chart abstraction for the validation data (see online supplemental appendix 2 for results using derivation data). Cardiomegaly and dyspnoea on exertion showed the best performance at PPV 96.7% and 94.5%, respectively. Conversely, hepatojugular reflux and S3 gallop had the lowest PPVs (0.0% and 11.8%, respectively). A major factor in the poor performance was the low frequency of these variables in the patient sample, 0 and 5 occurrences for hepatojugular reflux and S3 gallop, respectively. Pulmonary oedema demonstrated the best sensitivity (91.7%) and hepatomegaly demonstrated the best specificity (99.0%). See online supplemental appendix 3 for performance of NLP in ascertaining Framingham HF phenotype criteria variables for each study hospital.
NLP performance in the ascertainment of the Framingham HF phenotype from EHR data
Overall, performance of EHR-based ascertainment of Framingham8 HF phenotype in the validation dataset was good, with 78.8%, 81.7%, 84.4% and 80.0% as sensitivity, specificity, PPV and agreement metrics, respectively (table 4).
Performance of NLP-based ascertainment of Framingham HF phenotype across study hospitals
Figure 3 shows EHR-based performance in the ascertainment of Framingham HF phenotype for each of the four study hospitals. There was good reproducibility of NLP performance and no meaningful differences in NLP performance across hospitals for the three performance measures of sensitivity, specificity and agreement (all 95% CIs overlap between hospitals for each performance measure).
Here, we report on the derivation and validation of an open-source software NLP application that uses EHR data to ascertain HF according to the established Framingham schema in patients hospitalised in dispersed regions of the USA. EHR-based identification of the Framingham HF phenotype had very good performance characteristics (sensitivity: 78.8%, specificity: 81.7%, PPV: 84.4% and agreement: 80.0%) and was reproducible across the four study hospitals.
Several studies have investigated the use of billing codes and lab results to ascertain Framingham HF phenotype in inpatient settings within single healthcare systems.17 18 To our knowledge, this is the first study to describe the performance of EHR-based NLP tools to ascertain Framingham HF phenotype in inpatients from multiple geographically diverse hospitals from different healthcare systems. Our results compare favourably with studies using ICD-9, diagnosis related group (DRG) codes and lab results to ascertain Framingham HF phenotype. Using ICD-9 and DRG codes, Presley et al17 ascertained Framingham HF phenotype for hospitalised patients in the Veterans Administration (VA) healthcare system.17 The VA study demonstrated sensitivity of 45.1%, specificity of 99.4% and a PPV of 89.7% for Framingham HF phenotype in population that was homogenous with respect to gender (98.8% male). Using ICD-9 codes, HF medications and lab results, Tison et al18 ascertained Framingham HF phenotype for inpatients within a single healthcare system in Minnesota. Of the multiple study algorithms used in the study, the one with the highest PPV (86.5%) had a sensitivity of only 41.6%.
Our study adds to the growing body of evidence which suggests that NLP has the potential to improve the cost-effectiveness and timeliness of phenotyping in clinical and epidemiological studies by reducing the need for manual chart abstraction.
In this first step towards the development of a robust protocol for EHR-based NLP surveillance of hospitalised HF patients, we designed a prototype system that had good performance in ascertaining Framingham HF phenotype that was reproducible across four hospitals selected to be geographically dispersed. Underlying this reproducibility, however, was considerable effort required to harmonise a single NLP algorithm that accurately and consistently performed well (figure 3) across the four hospitals.
Evaluation of our results revealed several lessons learnt in the extraction of HF phenotype criteria. First, having complete sets of clinical note and report types from hospitals likely had a significant impact on performance. Our study used NLP to process emergency department notes, admission notes, discharge summaries and imaging study reports. Given the notable lack of standardisation of note type nomenclature across hospitals, we found significant variability between the four study hospitals in nomenclature used to identify specific clinical note types. For example, participating hospitals designated discharge summaries as ‘Discharge Summaries’, ‘Discharge PN’, ‘PMNDIS’ and instances in which the discharging physicians name was concatenated with ‘Discharge PN’ (eg, Smith Discharge PN). To properly capture phenotypes and clinical outcomes from EHRs requires overcoming a lack of standardised nomenclature, variability in standards for defining and recording data elements, and uncertain collection of longitudinal information or data across settings of care. In contrast, these are all features embedded in the standardised community surveillance registry that systematically gathers data entered by many clinicians in numerous hospitals, and served as the benchmark to validate our HF phenotype identification and event classification from EHR. As is typically the case for dedicated registries, ARIC’s data element extraction from records is performed by trained abstractors according to specific definitions, standardised procedures, and use of specialised forms leading to highly reliable and valid information under quality control monitoring. Data in EHRs by contrast are captured in the process of patient care by various members of the clinical team, for purposes other than event ascertainment or analysis. Although several efforts exist to establish common data models for EHR data,19 20 such models are not yet in widespread use and standardised definitions when documenting patient care are uncommon.
The second lesson learnt from the study was the challenge in optimising NLP performance to accurately determine negation for Framingham HF phenotype criteria variables documented in clinical notes. We observed multiple instances in which clinicians documented negative HF signs and symptoms phrased as ‘patient denies cough, fever, abdominal pain, chest pain, dyspnoea’. In this example, it was often difficult to accurately assess whether an HF phenotype criteria variable was negated by ‘denies’. Similarly, formatting of negation terms often varied by clinician and hospital and included terms such as ‘no’, ‘denies’, ‘negative’, ‘neg’, ‘(−)’, ‘−’, ‘patient does not report’; among other idiosyncratic terminology. Another challenge was establishing negation when clinicians described conditions in discharge summaries under which it was appropriate for patients to take a given medication. For example, ‘use albuterol inhaler four times daily as needed for dyspnoea’. In this case, the ‘dyspnoea’ Framingham HF criterion should be negated because the patient is not currently experiencing dyspnoea, a conditional symptom in which a particular medication should be used.
There are limitations to our study results. The study population represents a sample of hospitalised patients selected for the likelihood of having congestive HF based on ICD-9-CM codes (the prevalence of Framingham HF was 52.0% for NLP and 55.8% for manual chart abstraction). However, this limitation can be mitigated by automated screening of patients using the same ICD-9-CM codes before using NLP ascertainment of Framingham HF criteria. Generalisability of study findings to other populations has not been tested. Furthermore, among the metrics used to ascertain NLP performance, estimated PPV is influenced by the prevalence of the condition. Lastly and not unexpectedly, PPV performed poorly for Framingham HF criteria that occurred infrequently in the patient population. Examples of those were hepatojugular reflux (n=0/406), hepatomegaly (n=3/406), S3 gallop (n=5/406) and PND (n=27/406) had PPVs of 0.0%, 33.3%, 11.8% and 33.3%; respectively (table 3). Nonetheless, because of their low prevalence in the study population, these criteria likely had a relatively small impact on the determination of the Framingham HF phenotype prevalence.
The means to assess the population burden of HF and the impact of medical interventions and public health policies on these metrics are limited, and largely rely on efforts by professional organisations such as the American Heart Association21 drawn from various NIH-supported observational studies. Our data suggest that NLP has good performance characteristics in determining Framingham HF phenotype in hospitals from four distinct regions of the country. Such estimates do not substitute for comprehensive population data, nor are they regionally (or nationally) representative, and they do not lend themselves to estimation of population burden metrics or temporal trends. A 2011 report from the Institute of Medicine22 recommended a national surveillance programme to be put in place funded by the Affordable Care Act,22 but questions persist about the feasibility of community surveillance that can efficiently incorporate EHR capabilities for accurate estimates of disease burden and to monitor trends in cardiovascular diseases. To accomplish this, such surveillance should be able to link EHR resources to population denominators, harmonise diverse EHRs and implement information extraction tools of known validity and portability, while safeguarding patient privacy and be robust to changes in diagnostic fashion, technologies and coding practices. Such challenges need careful attention to realise the potential of EHR-enabled community surveillance. The alternative—the current inability to monitor population burden and trends—represents a significant impediment to the ability to gauge the impact of heath care and public health initiatives on the burden of, and trends in the most prominent contributors to morbidity, mortality and healthcare expenditures in the USA.
Importantly, the lack of community surveillance programmes encumbers the progress in the understanding of and in reducing health disparities in the incidence of the major cardiovascular health events and their outcomes.23 Regional epidemiologic surveillance programmes, such as ARIC’s, indicate that during the years 2005–2012, annual rates of incident hospitalised HF increased in all race–gender groups, but markedly so for black women. Ongoing HF surveillance efforts are therefore needed to identify vulnerable population subgroups and develop effective prevention strategies.
Future directions for our project include developing a user-friendly interface to adjust NLP algorithms based on institution-specific patterns in documentation of negations, as well as investigating the use of machine-learning technology to optimise performance of the current rule-based NLP system.24 Specifically, our goal is to approach 100% sensitivity while optimising specificity and PPV.
In conclusion, by decreasing the need for manual chart review, our results on the use of NLP to ascertain Framingham HF phenotype from free-text EHR data suggest that validated NLP technology holds the potential for significantly improving the feasibility and efficiency of conducting large-scale epidemiologic surveillance of HF prevalence and incidence.
Data availability statement
Data are available upon reasonable request. All raw data for the study will be available upon request.
This study was approved by the University of North Carolina Institutional Review Board (IRB Study ID#: 19-3462). All participants gave informed consent before taking part in the study.
The authors thank the staff and participants of the ARIC study for their important contributions.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Contributors All named authors are responsible for the reported research. CRM, EW, WR, SH, GH and AMK-N: substantially contributed to the conception of the project, acquisition of data, interpretation of data for the study and writing of the manuscript. CRM, SJ and HY: contributed significantly to data analysis for the study.
Funding The Atherosclerosis Risk in Communities study has been funded in whole or in part with federal funds from the National Heart, Lung and Blood Institute, National Institutes of Health, Department of Health and Human Services, under contract nos (HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700005I and HHSN268201700004I).
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.