Objective To investigate the distribution and content of quoted text within the electronic health records (EHRs) using a previously developed natural language processing tool to generate a database of quotations.
Design χ2 and logistic regression were used to assess the profile of patients receiving mental healthcare for whom quotations exist. K-means clustering using pre-trained word embeddings developed on general discharge summaries and psychosis specific mental health records were used to group one-word quotations into semantically similar groups and labelled by human subjective judgement.
Setting EHRs from a large mental healthcare provider serving a geographic catchment area of 1.3 million residents in South London.
Participants For analysis of distribution, 33 499 individuals receiving mental healthcare on 30 June 2019 in South London and Maudsley. For analysis of content, 1587 unique lemmatised words, appearing a minimum of 20 times on the database of quotations created on 16 January 2020.
Results The strongest individual indicator of quoted text is inpatient care in the preceding 12 months (OR 9.79, 95% CI 7.84 to 12.23). Next highest indicator is ethnicity with those with a black background more likely to have quoted text in comparison to white background (OR 2.20, 95% CI 2.08 to 2.33). Both are attenuated slightly in the adjusted model. Early psychosis intervention word embeddings subjectively produced categories pertaining to: mental illness, verbs, negative sentiment, people/relationships, mixed sentiment, aggression/violence and negative connotation.
Conclusions The findings that inpatients and those from a black ethnic background more commonly have quoted text raise important questions around where clinical attention is focused and whether this may point to any systematic bias. Our study also shows that word embeddings trained on early psychosis intervention records are useful in categorising even small subsets of the clinical records represented by one-word quotations.
- mental health
- health informatics
Data availability statement
Data are available upon reasonable request. Data must remain within the SLaM firewall and any requests to access the data can be addressed to email@example.com.
This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
A large sample size (n=33 499) for logistic regression of all patients receiving mental healthcare on 30 June 2019 was used in this study and comparisons made between characteristics of groups with/without a quotation.
Pre-trained word embeddings used to categorise one-word quotations based on a large electronic mental health record corpus of around 23 million words each.
27% (9118) of the data for those with quotations on 30 June 2019 contained variables with missing values and therefore were not included in the adjusted logistic regression analysis.
Investigator subjective judgement is used to determine the category label of clusters and consequently the optimum number of clusters.
Mental health electronic health records (EHRs) routinely capture a wealth of unstructured information detailing a patient’s clinical journey including symptoms, observed behaviours, contextual factors, assessments, interventions and outcomes within the free-text fields of case notes and correspondence.1 The majority of studies using this information have focused on these clinical constructs.2 However, the EHR is also a narrative account written from the perspective of healthcare professionals.3 Within this account, clinicians often add exact quotes from patient testimony and other parties, for example, as evidence for their diagnosis or other decisions.4 5 Quoted text in the EHR has the potential to give insight into the types of information recorded by clinicians and into the patient voice, although as secondary information filtered through the lens of the clinician, reflecting both the focus of the encounter and the individual clinician’s reporting style.6 7 This is of particular interest in two respects.
First, due to the lack of standardisation of clinical reporting,8 it is unknown to what extent there is coherence in clinician testimony as represented by quoted text and how this relates to outcomes for patients. In this respect, many previous studies, rather than examining quoted speech directly, have looked at instances of ‘referencing’, where the source of text is assigned to a third party using the ‘he/she says’ construct. In one study, there was a greater relative frequency in third-person pronoun use in a group of veterans who had died from suicide, in contrast to a comparison service-user group,9 and another study found an increase in referencing when doctors communicated negative news to patients.10 On the other hand, a previous study by our group found that the frequency of quoted text in the EHR did not change in the time period nearer a suicide attempt, indicating that clinicians did not change the frequency of directly reporting patient testimony even when patients symptoms may have worsened.7
Second, incorporating the patient voice in the EHR has become a growing area of interest11 12 with data from new studies indicating that inclusion of electronic patient-reported outcomes (ePROs) is associated with improved levels of patient care in areas such as cancer treatment.13 Given this context, the examination of the content of quoted text pre-existing in the EHR at least represents the beginnings of wider inclusion of the patient voice while ePROs are under development. However, little or no research has been carried out to date on such text.
As a precursor to understanding the content of quoted text, it is important to understand the patient populations from whom quotations are taken. It is currently unknown, for example, which patients are more likely to have quoted text in their EHR, or if there are variations between different demographic or diagnostic groups. Therefore, building on our previous work to ascertain quoted text at scale in the full EHR, the first objective of this study was to ascertain a fine-grain understanding of the distribution of quoted text within patients receiving secondary mental healthcare by undertaking analysis of frequencies by key demographic and clinical characteristics. Second, we sought to investigate the content of the quoted text itself. Again, to the best of our knowledge, the content of quoted text in EHRs is largely unknown, limiting conclusions that can be drawn. For example, it is unknown whether quotes predominantly relate to psychopathological terms that clinicians have been trained to note down, or whether they cover other indications of patient experience outside medical terminology.
Due to the large volumes of the data, we opted to approach the problem using natural language processing (NLP) methods and to apply k-means clustering to extracted text, an unsupervised method, with the aim of classifying the quotations. NLP has increasingly been used to extract clinically relevant information such as symptoms and medication from EHRs,14 and as part of the work to investigate quoted text frequencies in relation to self-harm, we had already developed an algorithm to identify and extract these text strings from a large mental healthcare EHR.7 One way of representing words is via word embeddings, where semantically similar words have similar numerical vector representations.15 These vectors are generated by applying machine learning models over specific text corpora. Previous studies on the same mental healthcare platform as our data have generated word embeddings to identify terminology around general symptoms of mental illness16 and more specifically psychosis.17 Focusing on one-word quotations in the first instance, as the most common form, we sought to address the deficiency of information on content and to investigate the extent to which these pre-trained embeddings might be used to classify the quotations identified from a mental healthcare EHR.
The South London and Maudsley (SLaM) National Health Service (NHS) Foundation Trust (SLaM) provides comprehensive, near-monopoly mental healthcare services to a geographic catchment of around 1.3 m residents in four boroughs of south London, as well as some national specialist services. The mental health records used in this study were assembled using SLaM’s clinical record interactive search (CRIS) platform, which currently accesses mental healthcare records for over 500 000 patients, rendering them de-identified and accessible for research use.18
The distribution of individuals with quoted text
An overview of the methodology is given in figure 1. The first objective of the study was to describe the distribution of those with quoted text at the database creation date, 16 January 2020, on which there were 365 555 total active patients in SLaM from which quotations would potentially arise. The date of the first instance of quoted text for a patient was used as the index date to determine the variable values. Additionally, for comparison purposes, we extracted the same variables for all active SLaM patients on a particular index date, 30 June 2019 and compared the people with or without quoted text to see if there were any differences.
All variables were derived from structured text fields at index date. The basic demographic variables were gender, ethnicity and age. For analytical purposes, we summarised ethnicity into five groups, as follows: white European (British, Irish, any other white background), black (African, white and black African, Caribbean, white and black Caribbean, any other black background), Asian (Bangladeshi, Indian, Pakistani, Chinese, any other Asian background, white and Asian), other (any other mixed background, any other ethnic group) and missing. Other variables included were: the Index of Multiple Deprivation (IMD) Score, most recent primary diagnosis, whether any inpatient care was received in the preceding 12 months with a binary yes/no outcome, the number of community face to face contacts in the preceding 12 months and the year of first patient referral to SLaM.
The IMD Score19 represents the socioeconomic status of neighbourhoods in the UK by combining various economic, social and housing indicators. It is based on the 2011 national census data and calculated from the patient’s most recent address at index date and the distribution of national percentile scores for given neighbourhoods has been most commonly categorised by tertiles in previous CRIS analyses20–22; thus the same categories were applied here for consistency (0≤x≤20, 20<x≤30, 30<x≤93), with lower scored groups indicating greater deprivation. The most recent primary diagnosis was determined from the International Classification of Diseases, 10th revision (ICD10 code) assigned to the patient at index date. In this analysis, the groups were represented by the first letter and/or digit of the ICD10 code, resulting in the following categories: F0x, F1x, F2x, F3x, F4x, F5x, F6x, F7x, F8x, F9x, Zx, Any other letter x and Not recorded.
The content of quoted text
The second objective of our study was to investigate the content of quoted text. Previous work by our team involved the development of an NLP tool to identify text occurring within quotation marks in the EHR using regular expression matching on a sample of patients hospitalised for a suicide attempt.7 The details of the application have been previously described,7 but in summary this algorithm yielded a precision of 0.92, recall of 0.93 against a manually annotated gold standard. As one-word quotations were the largest proportion by word count, these 199 384 instances (27% of the total) were chosen as the focus for this particular investigation.
Data and statistical analysis were performed using standard Python (V.3.6.8) libraries. To analyse the distribution of individuals with quoted text, first the characteristics for all individuals with quoted text at the database creation date of 16 January 2020 were calculated. Where an individual had more than one instance of quoted text, the date of the first quotation was used for variable extraction.
Second, in the sample of patients receiving mental healthcare on 30 June 2019, χ2 tests were used to test the associations between each individual variable and the presence of quoted text. χ2 tests were calculated with missing values excluded and for ordinal variables (age group, IMD, number of face to face community contacts, year of first SLaM referral) as a linear trend with one degree of freedom. Then logistic regression was used to analyse whether individual variables increased the likelihood of the presence of quotations and whether this was attenuated when all variables were incorporated in a single model. The logistic regression analyses were conducted using complete cases only.
To analyse the content of quoted text, first the one-word quotations (199 384) were extracted from the quotations database, lower-cased and lemmatised (converted to its base form) using the NLTK package, WordNetLemmatizer.23 This resulted in 25 873 unique lemmatised words. Of these, the most frequent quotations, classified as those appearing at least 20 times, were compiled into a list for further analysis, giving 1587 words. These were then mapped into vector format via the word embeddings and used in the clustering algorithm. We used the pre-trained word embeddings generated by Viani and colleagues17 trained specifically on CRIS records pertaining to (1) general discharge summaries (23.6 m words) reflecting all mental health disorder records (not restricted to psychosis) and (2) early psychosis intervention services (23.3 m words) across all mental health services. These were trained using a gensim24 implementation of Word2Vec with the Continuous Bag of Words model. We felt this was a suitable approach as these embeddings have been specifically trained on CRIS records, so are more likely to reflect the semantic similarity of words in the context specific to our mental health platform. For example, words such as ‘hyper’ are used more loosely in general terminology, but have specific meanings in the clinical context.
The idea behind clustering methodology is to use an unsupervised algorithm to group together semantically similar words as represented by their vector forms. K-means was chosen as it has high accuracy and speed with large datasets and provides good data segmentation.25 The implementation of k-means used was Python’s sklearn library, which by default runs the algorithm ten times with different centroid seeds to minimise the impact of clusters forming around local minima. Several methods are available to select the optimum number of clusters. We initially sought to use the ‘elbow method’ to determine the number of clusters by plotting the inertia for 1–20 clusters.26 However, as no obvious ‘elbow’ was apparent, we opted for silhouette analysis which examines the separation distance between each k-means cluster.26 27 The higher the average silhouette coefficients, the further the clusters are apart, with a maximum value of 1. Under this method, the higher the average silhouette score for each cluster, the better the representation of the data. Therefore, in statistical terms, the optimum number of clusters is represented by the highest average silhouette score over a range of possible values. This was determined by plotting the average silhouette scores for 1–20 clusters and examining the highest values. Our objective was to classify the content of one-word quotations in a way that added meaning to the group; it was therefore necessary to use subjective judgement to assess whether the optimum number of clusters provided distinct meaningful groups. To assess the usefulness of the k-means clustering algorithm in assigning semantically similar words to similar groups, the investigator examined the 20 most frequent words in each cluster and, using subjective judgement (the lead investigator’s initial decision followed by consensus achieved in the research group), assigned a descriptive label for each cluster. If at least one cluster could be given a homogeneous label, then the process was complete, and this number of clusters was deemed optimum. Otherwise, the number of clusters with the next highest average silhouette score was assessed in a similar way, and so on until it was possible for the investigator to assess at least one of the clusters as a homogenous group.
We did not directly incorporate patient and public involvement (PPI) into this particular study, but the SLaM Biomedical Research Centre (BRC) Case Register used in the analysis was developed with extensive PPI and is overseen by a committee that includes service-user representatives.
The previously developed tool7 to identify quoted text from a sample of CRIS records was run over the whole CRIS database, which contained 365 555 records on 16 January 2020. After removal of blank quotes and those representing ‘html’ tag data, 728 505 quotations were identified from CRIS, relating to 53 476 individuals. The quotations were further categorised in terms of word count. The mean number of words in a quotation was 9, median 3, SD 25 and range 1–309, indicating a wide distribution in the size of quotations with a positive skew. Details for the volumes of quotations by word length are described in table 1.
Characteristics of patients with at least one instance of quoted text are displayed in table 2. Characteristics of the total number of patients in receipt of SLaM care on 30 June 2019 index date are further shown in table 3, alongside the subset of patients with quoted text. Quoted text was more common in male patients, in those from black compared with white ethnic groups and those living in more deprived neighbourhoods. In terms of clinical variables, quoted text was more common in those with schizophrenia, schizotypal and delusional disorders (F2x), those receiving inpatient care and those with higher numbers of community face to face contacts in the preceding 12 months.
The unadjusted and adjusted logistic regression results are presented in table 4. The presence of any inpatient care in the preceding 12 months is the strongest individual indicator of quoted text, with those receiving care nearly 10 times more likely to have quoted text than those without. In terms of ethnicity, black individuals are approximately two times as likely as white Europeans to have instances of quoted text, although this is attenuated by the presence of other variables in the adjusted model. In comparison to the reference category, F2x (Schizophrenia, schizotypal and delusional disorders), other primary diagnoses are very unlikely to produce instances of quoted text. Additionally, gender, age group and IMD have very little effect on the presence of quoted text, in the adjusted model.
The optimum number of clusters suggested by silhouette analysis (see figure 2) for the discharge summary word embeddings was 2. This yielded two clusters, which appeared to distinguish between a group referring to sentiment (negative and positive) and a miscellaneous group. As the investigator observed that both groups appeared to contain mixed rather than distinct categories, the next highest silhouette score was examined and this yielded four clusters, which are displayed in table 5. This shows that group 0 is miscellaneous with no obvious descriptive category label, while the other groups appear to contain words related to mental illness, sentiment and verbs. The optimum number of clusters using the early intervention word embeddings was 9 (see figure 3), as shown in table 6. These groups appeared to contain more clearly differentiable categories, relating to mental illness, verbs (two groups), negative sentiment, people/relationships, mixed sentiment, aggression/violence and negative connotation.
To the best of our knowledge, this is the first study to describe the distribution of quoted text and the content of one-word length quotations from a clinical record database, the size of which adds strength to our findings. In a sample of SLaM patients at census date 30 June 2019, those with any inpatient care in the preceding 12 months are most likely to have quoted text in the clinical record, even after adjusting for other variables. Ethnicity was the next most pertinent factor, with quoted text appearing more commonly for those with a black ethnic background, but this was attenuated slightly in the adjusted model. Individuals with schizophrenia, schizotypal and delusional disorders (F2x) were much more likely to have quoted text than those with any other primary diagnosis, although primary diagnosis in general had little effect in the adjusted model. This study also found that one-word quotations were clustered into more distinctive categories using the early intervention word embeddings in comparison to those generated from discharge summaries. This resulted in nine groups which could be subjectively labelled as follows: mental illness, verbs (two groups), negative sentiment, people/relationships, mixed sentiment, aggression/violence and negative connotation.
As described, the relevant contexts for this study are the increasing volume of data now being routinely collected in EHR resources, and the growing awareness of the potential utility for such data to support research and improved clinical practice and/or service configuration, alongside the fact that EHRs reflect primarily a clinician’s perspective and authorship. Although quotations in the text remain filtered by that perspective, they do at least provide the beginnings of a ‘patient voice’ in the EHR while systems for direct patient input to the health record are developed. Given the lack of information on quoted text, even basic information such as the frequency of its recording and the characteristics of patients and/or services/contexts in which it is recorded, we sought to compile some preliminary data on distribution and on the content of single-word quotations as the most common type observed. This drew on earlier published work to ascertain such quotations automatically and at scale across the EHR through NLP.7
Our findings of any inpatient care in the preceding 12 months being the strongest indicator of quoted text may be due to those hospitalised receiving more frequent clinical observation than outpatients, leading to a greater volume of clinical notes from which quotations may arise. Further, there may be a greater focus on quoting text as evidence for decision making and medical defence practice,4 5 given that inpatients are more likely to be suffering from the most severe mental health conditions. On a similar basis, a greater frequency of quoted text from those of black ethnicity may be explained by higher levels of psychosis being present in this group in comparison to white ethnic backgrounds28–30 and consequently these individuals may receive a greater clinical focus.
The finding that the early psychosis intervention word embeddings produce more distinct categories in the data makes sense in the context of around 44% of all quotations in CRIS arising from patients with a recent primary diagnosis of F1x—mental and behavioural disorders due to psychoactive substance use, F2x—schizophrenia, schizotypal and delusional disorders or F3x—mood (affective) disorders, as all these have psychosis as a possible symptom. Therefore, word embeddings trained on records most similar to those from which the quotations are derived are likely to produce the best results in the clustering process. It is interesting to note that aside from categories related to mental illness and sentiment, this study has uncovered other more unexpected areas where clinicians may quote their patients, in terms of aggression/violence, people/relationships and verbs, indicating an emphasis on the circumstances of a patient’s experience rather than purely symptomatology.
Strengths and limitations
This study has several strengths. First it examines a novel area by focussing on the distribution and content of quotations within the EHR rather than the full record, giving insight into information clinicians may quote beyond clinical terminology. Furthermore, the large sample sizes for analysis of both distribution and content add significance to the findings. Additionally, the word embeddings used to represent the one-word quotations have been trained on millions of words which are highly relevant since they have been derived specifically from mental health records on the same platform.
The findings of our study need to be taken with several limitations in mind. One limitation of our study is that categories applied may be heterogeneous, for example, the ethnic groups selected. Another limitation is that 27% of the sample data were incomplete cases and therefore were not included in the adjusted logistic regression analysis. Another consideration is that data for logistic regression were examined at one point in time, so unknown confounders may be present in the data, such as previous service use for a different mental health disorder. In terms of investigating the content of quoted text, one key limitation is that the labelling of groups found by clustering is subjective and based on the assessment of the researcher. Another key limitation is that what is found in the text is dependent on what the clinician notes down; this will be subject to training and individual preferences and biases. Additionally, attribution of the speaker is not determined by the algorithm although the majority of quotations were from patients.7 Further, as we chose to investigate one-word quotations as a first step, the meaning derived from words in terms of clustering is limited without context. Therefore, further studies should examine longer strings of quotations to gauge a better understanding of content. Additionally, further studies could use contextual word-vector representations. Under this methodology, words are assigned vector representations based on the surrounding contextual words, to give a better idea of how a specific word is used in a particular context.
Despite limitations, this is an important study as the first of its kind to investigate the profile of patients and the areas of patient experience that are highlighted in quoted speech within the clinical record. The successful creation of a database across all CRIS to identify quoted speech is a first step in making this data available for future studies. The findings that inpatients and those from a black ethnic background more commonly have quoted text raise important questions around where clinical attention is focused and whether this may point to any systematic bias. Our study also shows that word embeddings trained on early psychosis intervention records are useful in categorising small subsets of the clinical records represented by one-word quotations.
Data availability statement
Data are available upon reasonable request. Data must remain within the SLaM firewall and any requests to access the data can be addressed to firstname.lastname@example.org.
Patient consent for publication
CRIS, as a data resource for secondary analysis, has IRB approval from the Oxford Research Ethics Committee C (reference 18/SC/0372).
Contributors RS and LJ conceived the study design with advice from SV. LJ wrote the paper and analysed the data. RS provided supervisory guidance. All authors provided critical input for the paper and approved the submission. RS is the author acting as guarantor for the study.
Funding LJ, SV and RS are part-funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at the South London and Maudsley NHS Foundation Trust and King’s College London. RS is additionally funded by a Medical Research Council (MRC) Mental Health Data Pathfinder Award to King’s College London; an NIHR Senior Investigator Award; the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) at King’s College Hospital NHS Foundation Trust. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.
Competing interests RS has received research support in the last 5 years from Janssen, GSK and Takeda.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.