Article Text

Download PDFPDF

Original research
Simplified data science approach to extract social and behavioural determinants: a retrospective chart review
  1. Andrew Teng,
  2. Adam Wilcox
  1. Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, USA
  1. Correspondence to Andrew Teng; akteng{at}uw.edu

Abstract

Objectives We aim to extract a subset of social factors from clinical notes using common text classification methods.

Design Retrospective chart review.

Setting We collaborated with a local level I trauma hospital located in an underserved area that has a housing unstable patient population of about 6.5% and extracted text notes related to various social determinants for acute care patients.

Participants Notes were retrospectively extracted from 43 798 acute care patients.

Methods We solely use open source Python packages to test simple text classification methods that can potentially be easily generalisable and implemented. We extracted social history text from various sources, such as admission and emergency department notes, over a 5-year timeframe and performed manual chart reviews to ensure data quality. We manually labelled the sentiment of the notes, treating each text entry independently. Four different models with two different feature selection methods (bag of words and bigrams) were used to classify and predict housing stability, tobacco use and alcohol use status for the extracted clinical text.

Results From our analysis, we found overall positive results and metrics in applying open-source classification techniques; the accuracy scores were 91.2%, 84.7%, 82.8% for housing stability, tobacco use and alcohol use, respectively. There were many limitations in our analysis including social factors not present due to patient condition, multiple copy-forward entries and shorthand. Additionally, it was difficult to translate usage degrees for tobacco and alcohol use. However, when compared with structured data sources, our classification approach on unstructured notes yielded more results for housing and alcohol use; tobacco use proved less fruitful for unstructured notes.

  • health informatics
  • social medicine
  • history (see medical history)
  • biotechnology & bioinformatics

Data availability statement

No data are available. The data used are unable to be shared due to patient privacy, confidentiality, and US healthcare laws.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • From our analysis, we can first see that text classifiers are promising when applied to extracted clinical notes for housing stability, tobacco use and alcohol use status.

  • Additionally, we found that structured data sources, such as diagnosis codes and intake surveys, vary and may not be the most holistic approach to understanding housing stability, tobacco use and alcohol use.

  • Our simplified approach has shown that open source simple text classifiers can be used to predict text sentiment for social and behavioural determinants and can supplement current structured sources to provide a more complete social history for patients.

  • However, even with a few limitations with our approach, we believe that this workflow can help inform clinicians and provide an easily implementable snapshot on patient social history.

Introduction

Most data can be generally categorised as structured or unstructured, where structured data can consist of items such as vital signs and lab results and unstructured data can consist of items such as text notes or images.1 Although structured data can generally be easier to extract and analyse, unstructured data can potentially provide an array of information not present or easily identifiable in structured data. As healthcare institutions expand data collection to include non-clinical features, more unstructured data surrounding behavioural health and social determinants of health (SDoH) information are starting to become more readily available. Furthermore, there has a been a growing interest around Medicaid patients, as SDoH can drive up to 80% of health outcomes, especially within this patient demographic.2 Therefore, SDoH and REAL (Race, Ethnicity and Language) data are now being used for secondary analysis as recent research has indicated that there is a correlation between SDoH and health outcomes and the increasing need to research health disparities across populations.3

SDoH and REAL can include housing stability, access jobs and healthcare services, education level, language and socioeconomic conditions.4 These indicators are descriptors of different societies and are useful as predictors of health outcomes and the uptake of health interventions.5 Because they can potentially be powerful indicators of health, many institutions are now starting to analyse and intake SDoH and REAL information, whether through text notes or standardised coding, such as International Classification of Diseases (ICD).6 Additionally, SDoH can provide health teams with a greater understanding of a patient condition holistically.7 However, there are challenges with SDoH intake as there is no standardised SDoH screening tool in the Electronic Health Record (EHR) across institutions8; additionally, coding schemes like ICD can prove to be unreliable in secondary analysis as coding can oversimplify symptoms and diagnoses leading to coding uncertainties and the fact that coding errors may be present from unintentional mistakes or even upcoding.9 10 Additionally, certain SDoH data may be more complete than others due to reimbursement incentives or other priorities.11 Past research has shown that hospital readmissions are highly influenced by patient health status and SDoH and suggest that clinical staff and researchers should consider SDoH when assessing readmission risk.12

The 2018–2019 King Country Community Health Needs Assessment (CHNA) reported the results from a health need assessment survey given to residents to identify regional perceived healthcare issues. It was determined that housing affordability and housing stability were major challenges dominating overall health.13 Mental health was also highlighted as a challenge for healthcare providers; mental illness can be caused by depression, schizophrenia and alcohol and substance-related disorders.13 The CHNA reported that adults in the lowest income tier were about 15 times more likely to experience severe psychological distress compared with their high-income counterparts. Additionally, it is noted that part of the region had continued challenges with adult smoking rates.13 Locally, it is estimated that there are at least 22 000 homeless individuals in King Country and more than 12 000 people in the Seattle region, a 4% increase over the previous year.14 Housing instability is associated with various health inequalities, such as shorter life expectancy, higher morbidity and increased usage of acute hospital services, ‘as the social determinants of homelessness and health inequities are often intertwined, and long-term homelessness further exacerbates poor health’.15 It is, therefore, important to treat housing stability and other SDoH as a combined health issue to aid in improving health outcomes in clinical settings. Although some research have shown that patients who experience housing instability are more likely to die following admission for severe sepsis than those with insurance,16 other research indicates that the effects of health inequalities are still unclear and need further investigation.17 Additionally, various behavioural habits, including tobacco and alcohol use, although may not directly be considered a SDoH, can impact health decisions and outcomes. For example, one study found that participants who drank alcohol and reported tobacco use consumed more foods higher in fat and sugar, low in vitamins and minerals as well as foods, considered by them to be less healthy and prepared in a less healthy way.18

Within our region, it has been noted in recent years that the smoking rate is around 13%; however, among Black/African-Americans or individuals with multiple races, is double the rate among white adults and four times higher than Asian adults. Additionally, it was reported that, when compared with high-income households, low-income households were three times more likely to be smokers.13 19 Drug and alcohol use also shared similar metrics; within the region, ‘drug and alcohol-caused deaths was 22% higher among Blacks and four times greater among American Indian/Alaskan Native than among non-Hispanic Whites’ and alcohol use represented 4.97/100 000 deaths locally in 2015.20 21 Therefore, it may be important to look at social determinants and health behaviours, together known as social and behavioural determinants of health (SBDH) to better understand the patient population.18

Recent technological advances in machine learning and artificial intelligence have shown great potential in providing a pathway for informaticians and clinicians to better understand unstructured data.

Within the clinical setting, there have been numerous approaches in adopting natural language processing (NLP) to aid with processing unstructured clinical text notes. Common uses of NLP include extracting diagnoses and chief reports as well as grouping of information for quality improvement. There are various NLP methods that can be used in the clinical setting, such as automatic tagging of conditions or variables of interest, sentiment classification or even text extraction. Various open source NLP and ontological tools, such as Automated Retrieval Console, Apache clinical Text Analysis and Knowledge Extraction System (Apache cTAKES), MetaMap and HITEx, Unified Medical Language System (UMLS) Metathesaurus and BioPortal have been used to aid with text extraction or classification.22–24 On the other hand, less complex classification methods have been used as well to identify specific groups of patients, risk assessment or aid in validating structured annotation.25–27 A recent scoping review found that although practitioners collect a variety of SBDH data at point of care through EHR, the overall use of automated technology is limited to date.28

With the idea of implementing an easily generalisable approach to classify selected social factors, we extracted both unstructured and structured data sources related to SBDH from a local hospital to identify and generate a framework to automatically extract and classify SBDH from text notes. We focused on housing stability status, tobacco use and alcohol use. These three social factors were chosen due to their direct impact on health outcomes and the local public health impact14–18 and presence in the EHR. To tackle challenges associated with SBDH extraction from unstructured text notes, we aimed to create a generalisable framework using low barrier open-source tools that are commonly used in the data science field. Because notes and stylistic choices can be institution and location specific, we sought not to create a model that is generalisable but rather a simplified method that could be potentially easily implemented using common off the shelf NLP and data science tools.

Methods

Study design and overview

A high-level overview of our workflow is seen in figure 1. We retrospectively extracted patient data from the acute care setting at a level I trauma centre and academic teaching hospital with the aim to create a general and easily applicable workflow to extract and classify SBDH factors from clinical notes. We applied a two-pronged approach and collected unstructured data from a subset of patients over a 1-year timespan (group A) to create and test the text classification model and also collected structured and unstructured data from a subset of patients over a 5-year timespan (group B) to apply the best model created from group A and compare results between the two data types. We performed automatic classification and scoring of patients via various NLP classification methods on three social factors: (1) housing stability, (2) tobacco use and (3) alcohol use. Our general workflow for housing stability, a similar approach was also used for tobacco and alcohol use, is seen in figure 2.

Figure 1

High-level overview of the workflow process.

Figure 2

Text extraction, classification and scoring workflow. ED, emergency department.

Study population

Data were not only extracted from Harborview Medical Center, a 413-bed academic hospital that has a patient population consisting mostly from Washington, but also from a five-state area.29 In 2014, there were 17 121 inpatient admissions, where 19% of the patients belong to a racial or ethnic minority and 37% of patients were enrolled in Medicaid.29 30 Additionally, in 2015, the non-US born population was estimated to be around 21% in Seattle highlighting the potential diversity that could be found with this patient population.30

Data sources, extraction and validation

We extracted both structured and unstructured data sources related to housing stability, tobacco use and alcohol use using Structured Query Language (SQL) queries called directly from an integrated python-based Jupyter Notebook:

  1. Structured data sources include billing and diagnostic/ICD 9 and 10 codes, questionnaire or Epic SmartForm responses, address fields (location), problem list (ICD 9), patient encounters, clinical events (actual encounters of care) and discharge/disposition location.

  2. Unstructured data sources consisted of text notes from the emergency department (ED), admission (admit) notes, social work and ambulance notes.

Discharge notes were not explored as they were not recorded in the same subdivided format as the admit and ED notes, making selective text extraction of SBDH difficult. From our initial list of patient identifiers over a 1-year timespan from group A, we performed manual EHR validation of a random subset of 50 patients to validate the completeness of the clinical notes and confirm the location of social history and social factors in clinical notes. Extensive research and conversations with an internal data analyst confirmed the location of these topics (housing, tobacco use and alcohol use) within structured data sources.

Data cleaning

After confirmation, clinical notes were extracted for both groups A and B. The notes were cleaned (eg, symbols removed, converted to lowercase) prior to classification and analysis in the Python Jupyter notebook via Natural Language Toolkit (NLTK). Our general text extraction and cleaning workflow can be seen in figure 3. However, housing stability notes and tobacco or alcohol use notes were stylistically and grammatically different, and both sets needed distinct additional cleaning steps. Housing stability notes that contained the phrase ‘not homeless’ were converted via regex to say ‘housed’ instead. Additionally, for housing stability, a concept dictionary was also created to substitute local facility names with more general concept (eg, ‘Union Gospel Mission’ was converted to ‘shelter’). This was done to explore how the algorithms handle formal nouns.

Figure 3

Text extraction and cleaning process. Additional steps were performed for notes when classifying text related to tobacco and alcohol use to extract negative sentiment doubles or triples. ROS, Review of Systems.

For text notes in group B, we performed an additional concept extraction step. Tobacco use and alcohol use notes often contained incomplete (lacking the subject, predicate, object format) triples or doubles (eg, ‘Denies smoking, drinking, drugs’). Due to their incomplete sentence structures, common NLP tools to parse, extract and classify triples, such as Stanford CoreNLP, were not suitable as these tools rely on having all three parts of the triple present. These notes related to tobacco and alcohol use, therefore, underwent an additional step that performed a separate relation extraction that would first identify a negative sentiment word (eg, denies), then individually extract the following SBDH-related objects in the list by commas or conjunctions (eg, and, or), and then label, or reclassify if necessary, the negative sentiment to all components of the list. Our process is seen in the left side of figure 3. If the regex extraction of negative lists resulted in a different result from the text classification prediction, the regex extraction would overwrite the end result prior to scoring. Once these steps were performed, the data were considered clean and suitable for classification.

Model building

Cleaned text from group A were used to generate and test the classification models. These notes were split in 70/30 validation and testing sets. We applied four different common NLP text classification models to the testing sets (via SciKit Learn): multinomial naïve Bayes, support vector machine, logistic regression and random forest. Default parameters and a bag-of-words approach were used. The best-performing model by accuracy was then chosen and applied to the larger corpus, Group B, with notes from patients in Group A removed, to avoid overfitting and classification bias. This process was performed for housing, tobacco use and alcohol use.

Scoring generation

In order to create a simple method of identifying patients who are experiencing social instability, we created a scoring metric based on the classified notes. After applying the optimum model by accuracy to the entire corpus of extracted text notes, housing stability, tobacco use and alcohol use scores were generated. Patient identifiers were mapped by patient location and those who were not in the acute care setting during this timeframe were removed. Three different scoring approaches were used to describe these social factors: (1) predictions were averaged by patient encounter, then averaged by patient identifier, (2) predictions were averaged by year, then by patient identifier and (3) predictions were averaged by year, where each year then had a weight where the most recent year had the highest weight and the furthest year had the lowest weight (eg, predictions from 2019 were weighted by a factor of 5 and predictions from 2015 were weighted by a factor of 1). This scoring generation process was then repeated on our structured data for all three social factors and the results were compared and analysed. Structured data were also extracted for our list of patients in group B.

Patient and public involvement

No patients were involved. The retrospective exploration is a part of a larger study and was approved by the University of Washington Institutional Review Board #STUDY00006723. Patient data elements, including encounter identifiers, race, age and notes with SBDH, were extracted directly from the data warehouse and stored on encrypted computers and were not distributed or shared outside of the secured and closed environment. No patient identifiers or names were stored in this analysis.

Results

Characteristics of study subjects

Clinical notes (ED, admit, social work and ambulance) between 2015 and 2019 were extracted and included, forming group B. Notes from the first 200 patients were included in group A and notes from 1 47 457 patients were included in group B. During the same time frame, 61 767 patients were in acute care. After extraction and model prediction, the patient notes were cross-referenced with inpatient location and only notes from those who were in acute care were retained, for a total of 43 798 patients from 2015 to 2019. The patient demographics of this final subset were 63% (n=27 575) men, 37% (n=16 223) women, 88.2% (n=38 634) not Hispanic or Latino and 10.5% (n=4609) Hispanic or Latino and 1.3% (n=555) unknown or not answered. Further descriptive statistics are found in table 1.

Table 1

Population demographics

Data attributes

Table 2 illustrates the amount of data for each corresponding extraction level, specifically for housing status. We first started with extracting text from the ED and admit notes, forming group A, which consisted of 50 000 rows or text entries and covered 3200 unique patients, over a 1-year time frame. From there, we manually labelled housing stability concepts in a binary fashion, where 0 would indicate housing stability and 1 would indicate any level of housing instability, regardless of severity. As manual labelling can be a labour-intensive process, only the first 6000 text rows were labelled, covering 218 unique patients. However, within these first 6000 rows, numerous notes did not contain text that alluded to housing status or were empty due to patient condition. Therefore, only 1785 out of the 6000 rows were labelled, covering 200 unique patients, where 995 (55.7%) were labelled as housing stable and 790 (44.3%) were labelled as housing unstable. We also found that 5.7% of the entries within this subset were duplicates or copy-forward entries. The same workflow was performed for labelling tobacco and alcohol use. However, only 1108 rows were labelled for tobacco use and 1220 rows for alcohol use, where in both cases, 0 indicated no use, 1 indicated rare/previous/occasional use and two indicated current use, regardless of degree. Tobacco use resulted in 446 (40.3%) labels for no use, 129 (11.6%) labels for rare/previous/occasional use and 533 (48.1%) labels for current use. Similarly, alcohol use resulted in 595 (48.8%) labels for no use, 185 (15.2%) labels for rare/previous/occasional use and 440 (36%) labels for current use.

Table 2

Extracted data amounts for housing status

Model performance

Four different common text classifiers, mentioned in the Methods section, were applied to the manually labelled group A data. The statistical metrics, including accuracy, precision and recall, are seen in tables 3 and 4. The accuracies between the classifiers and each classification technique for housing stability were overall fairly high ranging from 84.36% to 92.18%. The accuracies for tobacco and alcohol use were lower, ranging from 70.87% to 84.68% for tobacco use and 69.95% to 82.79% for alcohol use. Additionally, for each top performing model, the most influential words for text classification, for each social factor, are seen in table 5. The best-performing classification models were selected for each social factor and were used to apply the model to our entire corpus in group B.

Table 3

Accuracies among text classifiers

Table 4

Best-performing classifier detailed metrics

Table 5

Word or phrase importance ranking

Scoring results and comparison

After classifying text for housing stability, tobacco use and alcohol use for patients in group B, we applied a scoring metric scheme, described in the Methods section. We generated three different scores that were calculated and weighted differently based on time. Our final score weighs more recent note entries and their resulting classification score higher than notes from previous years as social factors and their influence can change over time. Using the same process, we extracted and scored housing stability, tobacco use and alcohol use with structured data sources and compared the results with the unstructured process.

Housing stability

Using notes, we classified 839 patients as housing unstable, a score above 0.5, and 21 370 patients as housing stable, a score of 0.5 and below. In total, we classified 22 209 patients with this text classification workflow, which covered 50.71% of the acute care patients within the same timeframe. When compared with structured data sources, only 791 (1.81%) additional patients were found.

Tobacco use

We classified 4911 patients as currently using tobacco, regardless of amount or degree (1.5–2) using text notes. We classified 1480 patients as having rare/occasional/past use of tobacco (0.5–1.5), and 7139 patients as not using tobacco (0–0.5). In total, we classified 13 530 patients with this text classification workflow, which covered 30.9% of the acute care patients within the same timeframe. When compared with structured data sources, 179 351 (40.9%) additional patients were captured.

Alcohol use

We classified 2738 patients as currently using alcohol, regardless of amount or degree (1.5–2) using text notes. We classified 4050 patients as having rare/occasional/past use of alcohol (0.5–1.5), and 13 885 patients as not drinking alcohol (0–0.5). In total, we classified 20 673 patients with this text classification workflow, which covered 37% of the acute care patients within the same timeframe. When compared with structured data sources, no additional patients were found.

Discussion

Our approach to a simple text classification method for various SDoH has shown positive results. The selected classification models were chosen as they were the most commonly used classification models when researching text classification techniques. Furthermore, these models were robust enough to curtail the need for more complex machine learning-based text classification methods, which may be harder to interpret in the clinical space as the weights and decisions can be confiscated due to the black box nature of these more complex classification methods. In general, linear models are fast to train, can work well with sparse data and offer interpretability.31 Additionally, recent research has also suggested that more complex machine learning approaches may not yield statistically significant improvements in predictive power to justify the time and effort necessary to implement and test these more complex methods. Although promising, more advanced methods of NLP, such as convoluted neural networks, may not provide a significant tradeoff in improvement or accuracy versus transparent understanding of rule-based approaches. In fact, Yao et al found that the F1 scores for Convolutional neural network (CNN) via TensorFlow did not improve significantly for interested features when compared with logistic regression and support vector machine implementations.32 Finally, generalisable methods to create institution-specific models can be better for the healthcare system as a whole as each institution records clinical information with variances.

Although SBDH information and other social factors can be indicative of overall health, collection of SBDH heavily relies on clinical staff to screen and document SBDH. Furthermore, it also assumes that patients will respond accurately and truthfully. Various financial incentives from the federal level have propelled collection of social factors, such as tobacco use and tobacco cessation. However, other social factors, which can be equally as important, such as alcohol use, are not incentivised to be captured; rather only more severe instances are incentivised, such as alcohol dependence or alcohol addiction or disorder.11 33 Due to this discrepancy, we found that structured data sources were less reliable, and that text classification aided in detailing a patient more holistically.

Our text classification of unstructured data relied solely on ED, admit, social work and ambulatory notes as our parsing and extraction method could only work with notes in a certain format with the social history heading. Social factors and other social history could also be recorded in other locations but were not compatible with our approach. Furthermore, social work and ambulatory notes used for housing status only and were only extracted if the notes contained a word or phrase related to housing instability. This approach was used as the notes were typically stored in a more unstructured format compared with the ED and admit notes; there were no section headers. The lack of section headers increased the difficulty to extract the notes and the notes would often verbiage that would interfere with the simple text classification approach that we used. Therefore, we decided to extract notes that contained words relating to housing instability. Additionally, tobacco and alcohol use notes had stylistic and grammatical challenges. These social factors were often grouped together in incomplete triples (eg, ‘denies drinking, smoking, illicit drug use’). The classification algorithms often had trouble reciprocating the negative connotation to all components of the triple. Therefore, we used regex to specifically extract these triples and classify the note based on the presence of words related to tobacco or alcohol. Without this additional data cleaning or manipulation step, the negative sentiment in a list would not have been applied to all elements within the list, but rather only the first element. In our example of ‘denies smoking, drinking, drugs’, the negative sentiment of ‘denies’ would have only been applied to smoking as smoking immediately follows ‘denies’. However, with our additional concept extraction step, the negative sentiment of ‘denies’ is now also applied to ‘drinking’ and ‘drugs’. These results would then override the text classification algorithm, if there was a discrepancy. Therefore, the scoring metrics for these cases would not necessarily reflect the accuracy or performance of our scoring method.

It was interesting to find that tobacco use was recorded significantly more often in structured data sources compared with alcohol use and housing stability. However, because tobacco use is a (Centres for Medicare and Medicare Services) CMS core quality measure, it can be expected that this feature is more available in structured form as it is often directly asked to the patient on intake forms, screeners or during cessation treatment.11 Furthermore, the Joint Commission created the Tobacco Performance Measure Set, which are three standardised performance measures addressing tobacco screening and cessation counselling: (1) tobacco use screening of patients 18 years and over, (2) tobacco use treatment, including counselling and medication during hospitalisation and (3) tobacco use treatment management plan at discharge. CMS began using these performance measures in 2016.34 Because alcohol consumption is not a recommended CMS core quality measure for adults, the amount of data regarding alcohol use is not complete in structured form as it may not be consistently collected during intake procedures.

Past research has consistently pointed towards SBDH impacting patient health and outcomes. However, collection of SBDH can be a major limiting factor in the ability to model and integrate these data. There has not been a standardised collection process for SBDH data across the institution, whether it is recorded through notes or electronic forms. Additionally, many times, SBDH data may not be asked due to patient condition or it might not be updated regularly. Providers and healthcare institutions should strive to collect SBDH data more regularly even if the data fields are not empty as SBDH status can change. These intake procedures should be present and not optional; currently, only language preference must be completed due to translation laws in place. Additionally, educating patients to use patient portals and update information via these portals can provide more current SBDH information. However, we should note that vulnerable populations would most likely not be the primary audience to use this feature, and this is the subpopulation that arguably needs more attention.

Limitations

Our study has numerous limitations. There were two distinct areas in our workflow that required manual attention: (1) EHR review and (2) labelling of features. Manual EHR review was performed to ensure that the notes contained social history information in a consistent location prior to widespread text extraction. We initially validated this with a random set of 10 patients, but later expanded our validation to 25 patients. We felt that having consistent results with the 25 patients indicated a high level of confidence. Manual labelling of features was time-consuming and taxing. Although only one author performed the feature labelling, having multiple team members would provide better and possibly more consistent classification.

This approach, although we aim to create a generalisable workflow, is still stunted by local customisations due to unique nuances in note-taking language. Patients can withhold information about their social challenges, making text classification harder to perform due to incorrect incoming data streams. Our approach relies on the fact that the patient has been seen within the healthcare system at some point in the past 5 years. This approach would not be applicable to those who are new to the institution or those who are not immediately identifiable. Classification levels for unstructured notes are not concrete as descriptive wording is also not concrete and can vary (eg, ‘patient was a former smoker’, ‘patient quit last week’, ‘patient is an occasional smoker’, etc). Structured data sources can add a more concrete sense to the classification. There were 5.7% copy-forward entries present as data collection of social factors may not always be appropriate (eg, patient is inebriated, in an altered mental state, etc). We did not incorporate outside ontologies, such as UMLS or MetaMap, as we were interested in creating a simple text classification approach that did not need to rely on outside entities. Furthermore, we believe that these ontologies would not have added a significant improvement in our approach due to the social factors (housing, alcohol, tobacco) that were investigated. Although minimised, applying NLP to clinical notes will always present limitations and risks with biased models, biased data and data privacy.35

Community needs are constantly changing as the health of the community is not static. Currently, the King County CHNA has identified obesity, healthcare access, insurance status and drug use as other potential SBDH information to explore. These data types would be stored in different areas of the EHR and within different notes. It would be interesting to see if our designed workflow presented could be applicable and generalised to meet the needs of other SBDH data. Although we aimed to create a simplified framework to extract SBDH data from clinical notes, more complex methods such as convoluted neural networks and more advanced NLP part of speech tagging may be worth exploring as they may help improve accuracy and precision of the classification. As more notes become available for patients, it will also be important to keep in mind the potential bias of having more notes present from sicker patients and evaluating ways to reduce this bias.

We sourced data from solely one medical centre. Patients might have had encounters or other visit types in neighbouring hospitals and healthcare systems in the region. The lack of data sharing between institutions prevents holistic collection of SBDH data. Data completeness is vitally important to the quality and accuracy of models that are dependent on big data. Poor data quality and completeness lead to lower utilisation and the lack of data can potentially lead to mistakes in the decision-making process; additionally, since there is no single or standardised source for SBDH data, the diversity of data and complexity of the associated data structures increase the difficulty and bottlenecks for data integration.36 The lack of a standardised methodology to collect and store all SBDH data will limit the potential of this research field. Additionally, SBDH factors are constantly changing for patients as their behaviours can change depending on their circumstance. Being able to aggregate these data and create adaptable models is crucial as these features are never static. Furthermore, public health and outreach services fluctuate over time. Creating a method or using an Application programming interface (API) to update the list of community shelters and other places for homeless services would be necessary to maintain an accurate understanding of a patients’ housing status.

Conclusion

From our analysis, we can first see that text classifiers are promising when applied to extracted clinical notes for housing stability, tobacco use and alcohol use status. Additionally, we found that structured data sources, such as diagnosis codes and intake surveys, vary and may not be the most holistic approach to understanding housing stability, tobacco use and alcohol use. Our simplified approach has shown that open source simple text classifiers can be used to predict text sentiment for social determinants and can supplement current structured sources to provide a more complete social history for patients. However, even with a few limitations with our approach, we believe that this workflow can help inform clinicians and provide an easily implementable snapshot on patient social history.

Data availability statement

No data are available. The data used are unable to be shared due to patient privacy, confidentiality, and US healthcare laws.

Ethics statements

Patient consent for publication

Ethics approval

This study does not involve human participants.

Acknowledgments

Sally Lee, Abdelhak Abdou, Marion Granich, David Carlbom

References

Footnotes

  • Correction notice This article has been corrected since it first published. The areas redacted in the previous version have now been added.

  • Contributors AT performed the data extraction, tool building and analysis. AW provided guidance and verification when needed. AT is the guarantor.

  • Funding This work was supported by the U.S. Department of Health and Human Services, National Library of Medicine Training Grant T15LM007442.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.