Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system

Beata Fonferko-Shadrach; Arron S Lacey; Angus Roberts; Ashley Akbari; Simon Thompson; David V Ford; Ronan A Lyons; Mark I Rees; William Owen Pickrell

doi:10.1136/bmjopen-2018-023232

Article Text

PDF

XML

Neurology

Research

Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system

Beata Fonferko-Shadrach1,
Arron S Lacey1,2,
Angus Roberts3,
Ashley Akbari2,
Simon Thompson2,
David V Ford2,
Ronan A Lyons2,
Mark I Rees1,4,
William Owen Pickrell1

¹Neurology and Molecular Neuroscience Group, Institute of Life Science, Swansea University Medical School, Swansea University, Swansea, UK
²Health Data Research UK, Data Science Building, Swansea University Medical School, Swansea University, Swansea, UK
³Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
⁴Faculty of Medicine and Health, University of Sydney, Sydney, Australia

Correspondence to Dr William Owen Pickrell; w.o.pickrell{at}swansea.ac.uk

Abstract

Objective Routinely collected healthcare data are a powerful research resource but often lack detailed disease-specific information that is collected in clinical free text, for example, clinic letters. We aim to use natural language processing techniques to extract detailed clinical information from epilepsy clinic letters to enrich routinely collected data.

Design We used the general architecture for text engineering (GATE) framework to build an information extraction system, ExECT (extraction of epilepsy clinical text), combining rule-based and statistical techniques. We extracted nine categories of epilepsy information in addition to clinic date and date of birth across 200 clinic letters. We compared the results of our algorithm with a manual review of the letters by an epilepsy clinician.

Setting De-identified and pseudonymised epilepsy clinic letters from a Health Board serving half a million residents in Wales, UK.

Results We identified 1925 items of information with overall precision, recall and F1 score of 91.4%, 81.4% and 86.1%, respectively. Precision and recall for epilepsy-specific categories were: epilepsy diagnosis (88.1%, 89.0%), epilepsy type (89.8%, 79.8%), focal seizures (96.2%, 69.7%), generalised seizures (88.8%, 52.3%), seizure frequency (86.3%–53.6%), medication (96.1%, 94.0%), CT (55.6%, 58.8%), MRI (82.4%, 68.8%) and electroencephalogram (81.5%, 75.3%).

Conclusions We have built an automated clinical text extraction system that can accurately extract epilepsy information from free text in clinic letters. This can enhance routinely collected data for research in the UK. The information extracted with ExECT such as epilepsy type, seizure frequency and neurological investigations are often missing from routinely collected data. We propose that our algorithm can bridge this data gap enabling further epilepsy research opportunities. While many of the rules in our pipeline were tailored to extract epilepsy specific information, our methods can be applied to other diseases and also can be used in clinical practice to record patient information in a structured manner.

natural language processing
epilepsy
validation
information extraction

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.

https://doi.org/10.1136/bmjopen-2018-023232

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

This study presents a novel method to automatically extract detailed, structured epilepsy information from unstructured clinic letters.
The method is based on open-source natural language processing technology.
The performance was validated using 200 previously unseen epilepsy and general neurology clinic letters.
The generalisability of the algorithm to population-level data and other diseases is limited at present but is possible with further work.

Introduction

Epilepsy is a common neurological disease with significant co-morbidity. Although advances have been made in understanding the aetiology, treatment and co-morbidity of epilepsy, significant uncertainties still exist. Research using routinely collected data offers an opportunity to explore these uncertainties. Recent studies have shown, for example, an increased onset of psychiatric disorders and suicide before and after epilepsy diagnosis,1 no association between anti-epileptic drug use during pregnancy and stillbirth,2 and an increased risk of premature mortality in people with epilepsy.3

Epilepsy research using routinely collected data currently tends to use sources such as primary care health records or hospital discharge summaries. The main disadvantage of these sources is that they do not contain detailed epilepsy information, for example, epilepsy subtype/syndrome, epilepsy cause, seizure type or investigation results. This limits the quality and type of epilepsy research questions that can be answered successfully. Almost all patient encounters with hospital specialists in the UK are documented in clinic letters to primary care doctors, other healthcare professionals and patients. Clinic letters have been written electronically for decades and offer a wealth of disease-specific information to enhance routinely collected data for research. Although detailed disease (epilepsy) information is found in clinic letters, they are usually written in an unstructured or semi-structured format, making it difficult to automatically extract useful information.

Natural language processing (NLP) technology can be used to analyse human language and offers a potential solution for automated information extraction from unstructured letters.4 NLP is increasingly being used for healthcare information extraction applications; for example, to extract symptoms of severe mental illness and adverse drug events from psychiatric health records,5 6 to identify patients with non-epileptic seizures7 and for the early identification of patients with multiple sclerosis.8

In this project, our objective was to develop and validate an NLP application to extract detailed epilepsy information from unstructured clinic letters, with the primary aim of using this information to enhance epilepsy research using routinely collected data.

Materials and methods

Study population

We used manually de-identified and pseudonymised hospital clinic letters to build and test the algorithm. The letters were provided by the paediatric and neurology departments of a local general hospital. Members of the clinical team manually changed patient details, clinician details as well as the names occurring within the text before the letters were available to researchers. We used 40 letters for training purposes to build rule sets, and a validation set of 200 letters to test the accuracy of the algorithm. The training set was randomly selected and included 24 adult (16 epilepsy, 8 general neurology) and 16 paediatric neurology letters. The validation set contained letters from various outpatient clinics (145 adult epilepsy, 37 paediatric epilepsy and 18 general neurology) from new patient and follow-up appointments, written by eight different clinicians.

Algorithm construction

We used the general architecture for text engineering (GATE) framework with its biomedical named entity linking pipeline (Bio-YODIE) (figure 1).9 We created an automated clinical text extraction system for epilepsy, ExECT (extraction of epilepsy clinical text), which used Bio-YODIE and our own customisations to map clinical terms to Unified Medical Language System (UMLS) concepts.10 The UMLS is a set of files and software, developed by the US National Library of Medicine, which combines information from over 200 health vocabularies with over 3.6 million concepts and 13.9 million unique concept names.11 UMLS uses concept unique identifiers (CUIs) to identify senses (or concepts) associated with words and terms.12 Bio-YODIE applies several strategies to assign the correct UMLS sense to terms in the text, and, where necessary, disambiguates against several possible meanings for the same term. These strategies include term frequency, patterns of co-occurrence with other terms and measures of context similarity.

Figure 1

ExECT pipeline to extract clinic date, date of birth and nine categories of epilepsy information from clinic letters. We used GATE architecture, modified versions of the Bio-YODIE, SLaM and ConText plugins with custom dictionaries and JAPE (Java Annotation Patterns Engine) rules. Bio-YODIE, biomedical named entity linking pipeline; POS Tagger, Part-Of-Speech Tagger; D.O.B, date of birth; ExECT, extraction of epilepsy clinical text; GATE, general architecture for text engineering; SLAM, South London and Maudsley.

We supplemented Bio-YODIE’s UMLS lookups with a set of custom gazetteers (native dictionaries used within GATE). We used some custom gazetteers to embed context into extracted UMLS concepts, that is, phrases to determine present, past or future tense, or terms to describe levels of certainty expressed in clinical opinion. For example, in the phrase ‘… could be consistent with simple partial seizures’, simple partial seizures are annotated with a certainty level indicated by the word ‘could’. We used five levels of certainty ranging from 1 (definitely not) to 5 (definitely) (see table 1 for more details). Variables with certainty levels 4 or 5 were considered to be positive findings and those with levels 1–3 to be negative findings. We used other gazetteers for specific vocabulary or colloquial terminology used by patients and clinicians when describing symptoms. Some were necessary to deal with the rigidity of the UMLS terminology, especially in relation to investigation findings, such as electroencephalogram (EEG) and MRI results. For example, the EEG abnormality indicated in the phrase: ‘EEG with spike and wave activity’ would not be matched with a UMLS lookup but our EEG results gazetteer would annotate ‘spike and wave’ as an abnormal EEG result, assigning it an UMLS CUI for an abnormal EEG outcome.

View this table:

Table 1

Details on the categories of extracted information and criteria for manual review which were used as algorithm development guidelines

We used and customised the South London and Maudsley (SLaM) GATE application to extract prescription information.13 We deployed the ConText algorithm to detect negation of extracted terms, that is, ‘this person does not have epilepsy’ and to identify normal test results, such as in ‘The EEG was not abnormal’.14 Finally, we used the JAPE scripting language to define rules based on varying combinations of UMLS and custom lookups to extract eight broad categories of information (see table 1 for more information). In total, we created 46 separate gazetteers and over 89 JAPE rule files in order to annotate the variables of interest, establish context and to remove certain annotations from the output.

An illustration of the ExECT GATE pipeline used to extract all items of interest is shown in figure 1. The source code is available at https://github.com/arronlacey/ExECT. ExECT was built using GATE V.8.4.1.

Measuring performance

We ran ExECT on a validation set of 200 previously unseen, de-identified, clinic letters. We compared the items of information extracted by ExECT with those extracted by manual review. The review was performed by an epilepsy clinician (WOP) who was blinded to the algorithm results until the review was complete. We used pre-defined criteria for the manual review of information items (see table 1). The core research team (BF-S, ASL and WOP) reviewed every disagreement between the manual review and ExECT, and a consensus was obtained from the group on the correct annotation based on our pre-defined guidelines (see table 1). We measured performance on both a per item and a per letter basis.

The per item test compared every mention of an information item in each letter. Frequently, there were several items in a particular category. For example, a letter could contain the following phrases, all of which confirm an epilepsy diagnosis: ‘diagnosis: temporal lobe epilepsy’ ‘… frequent complex partial seizures consistent with a diagnosis of temporal lobe epilepsy…’ and ‘Given that X has temporal lobe epilepsy, the best treatment is…’

The main purpose of ExECT is to enrich routinely collected data sets with epilepsy information and for this purpose a per letter score is potentially a more useful measure of its performance. For example, a letter may confirm temporal lobe epilepsy three times but only one mention of temporal lobe epilepsy is required to correctly classify that person’s epilepsy. In this context, extracting only one mention of temporal lobe epilepsy is as useful as extracting all three. In the per letter test, we, therefore, aggregated multiple mentions within a category in each letter to a binary decision based on ExECT’s ability to extract at least one true positive mention. In the above example, if ExECT had only correctly identified one of the three mentions of temporal lobe epilepsy, we would have scored it as having a recall of 100% on a per letter basis but only 33% (1/3) on a per item basis. We used a similar approach with seizure frequency, clinic date and date of birth; multiple mentions were counted in a per item, but only one true mention (in the absence of contradictory information) was considered to give a true positive result in the per letter method. For the medication annotation, in the per letter approach, only a full list of the drugs prescribed with the respective doses was considered to be a positive outcome.

Patient and public involvement

This research was carried out without specific patient or public involvement in the design or interpretation of results. Patients and members of the public did not contribute to the writing or editing of this manuscript.

Analysis and statistical tests

We used precision, recall and F1 score to measure the accuracy of ExECT. Precision is defined as the proportion of the instances extracted by the algorithm which are true, recall is the proportion of true instances extracted by the algorithm and F1 score is the unweighted harmonic mean of precision and recall: (2×precision× recall)/(precision+recall).

Results

We identified 1925 items in 11 categories across 200 letters. See table 2 for a summary of the performance of ExECT in identifying these items of information and table 3 for the evaluation of the results.

View this table:

Table 2

Information extracted from 200 epilepsy clinic letters: number of items and number of letters with items extracted.

View this table:

Table 3

The per item and per letter accuracy of ExECT when extracting epilepsy information from a validation set of 200 clinic letters, where an information item is defined as a single item in any category identified by the human annotator (see the Materials and methods section for more details)

Discussion

We developed an application capable of extracting a range of detailed epilepsy information from unstructured epilepsy and general neurology clinic letters, in order to enrich routinely collected data for research. ExECT reliably extracted epilepsy information from 200 clinic letters, written by different clinicians, with an overall precision, recall and F1 score of 91%, 81% and 86%, respectively, on a per item basis. ExECT performed best in extracting clinic date and date of birth (F1 scores of 98% and 99%) given that these fields consist of fixed format dates which are easier to extract. In terms of epilepsy-specific information, ExECT performed best for medication (F1=95%), epilepsy diagnosis (89%), epilepsy type (85%) and focal seizure types (81%). These items are frequently mentioned and presented in a relatively standard format, for example, medication is usually stated as drug name-strength-unit-frequency, and diagnosis appears at the top of letters in structured lists or in text with clear references to the patient.

ExECT was less accurate in identifying CT (F1=57%), MRI (75%) and EEG results (78%), seizure frequency (66%) and generalised seizure terms (66%). These items occasionally did not map completely to UMLS terms and had a more varied format in the clinic letters. For example, UMLS contains terms such as ‘EEG with irregular generalised spike and wave complexes’; however, there were often a variety of words between EEG and the associated result, for example, ‘EEG was found to show generalised spike and wave complexes’. Consequently, we created custom gazetteers that map to specific terms such as ‘spike and wave’ or ‘EEG’ and wrote JAPE rules to associate these terms with the EEG term to improve the performance of our algorithm. While this approach allows for variations seen in our training set, previously unseen variations in the validation set could not be captured. Similarly, the reporting of seizure frequency is highly varied, for example, ‘she had five seizures since March last year’ or ‘one or two focal seizures every evening’.

We achieved higher scores for precision, recall and F1 score (95%, 87% and 91%, respectively) on a per letter basis. A lower recall rate for medication on a per letter basis was due to the scoring method, where only a complete list of all medications was considered a true positive result. For example, if one medication was missing out of a list of four, this would be a negative result on a per letter basis. This lead to an increase in the false negative scores as compared with a per item approach. We propose that a per letter measure for categories containing multiple mentions, such as confirmation of epilepsy, provides a practical way to summarise information from clinic letters. Additionally, a per person measure (results summarised over several letters) could be used to determine epilepsy status as there will normally be several letters per person over a period of time.

Strengths

We used a gold standard data set of de-identified clinic letters to build and test ExECT, from which we accurately extracted novel epilepsy information for research. We can now iteratively develop ExECT over larger sets of clinic letters and use it to extract detailed epilepsy information for research on a population-level basis. We can also develop our algorithm for other diseases and potential clinical applications, for example, efficiently extracting relevant clinical information from historical letters to aid clinicians. Our system uses UMLS terminologies including the ability to map findings to CUI codes. This can be powerful in curating structured data sets that can be easily linked with high agreement to other coding systems, for example, SNOMED-CT.15

We used the open source GATE framework to develop our algorithm which provides useful built-in applications and user-developed plugins for NLP such as Bio-YODIE and the SLaM medication application. This undoubtedly made the process easier and quicker than other potential methods and provided a useful model for future similar information extraction applications.

Weaknesses

We used a relatively small number of letters sourced from one health board. Abertawe Bro Morganwg University Health Board is responsible for planning and providing healthcare services to approximately half a million people in southwest Wales. This limited the number of writing styles and letter structures available to validate our algorithm, given that manually de-identifying letters was resource intensive. The generalisability of our algorithm may, therefore, be limited. However, we have made efforts to extract information from the main body of text within clinic letters rather than relying on the letter structure alone.

It is difficult to account for the variability of the language used to express patient information in clinic letters. Some items of information such as seizure frequency and investigations require many complex rules where patterns are hard to predict. Further work could be focused on employing machine learning methods to compliment a rule-based approach; however, this would require a significant amount of time to annotate the large amount of documents required for such a task. All disagreements between ExECT and manual annotation were reviewed by the research team as a whole but we only used one clinician to review the letters, which might have added bias to how the validation set was annotated.

Comparison with other studies

NLP is being increasingly used for clinical information extraction purposes.4 The i2b2 project used Apache clinical Text Analysis and Knowledge Extraction System and Health Information Text Extraction to extract the following phenotypes with positive predictive value (precision) and sensitivity (recall): Crohn’s disease (98%, 64%), ulcerative colitis (97%, 68%), multiple sclerosis (MS) (94%, 68%) and rheumatoid arthritis (89%, 56%).16 A recent study on patients with MS, identified from electronic healthcare records, used NLP techniques to extract MS-specific attributes with high positive predictive value and sensitivity, namely, Expanded Disability Status Scale (97%, 89%), Timed 25 Foot Walk (93%, 87%), MS subtype (92%, 74%) and age of onset (77%, 64%).17 A study used clinic letters (available at www.mtsamples.com) to determine whether sentences containing disease and procedure information were attributable to a family member using the BioMedICUS NLP system. This achieved an overall precision, recall and F1 score of 91%, 94% and 92%, respectively.18

To our knowledge, there are only a few published studies on clinical epilepsy information extraction systems. Cui et al developed the rule-based epilepsy data extraction and annotation (EpiDEA) system, which extracts epilepsy information from epilepsy monitoring unit discharge summaries. EpiDEA achieved an overall precision, recall and F1 score of 94%, 84% and 89%, respectively, when extracting EEG pattern, past medications and current medication from 104 discharge summaries from Cleveland, Ohio, USA.19 Cui et al also developed the rule-based phenotype extraction in epilepsy (PEEP) pipeline.20 PEEP extracted the epileptogenic zone, seizure semiology, lateralising sign, interictal and ictal EEG pattern with an overall precision, recall and F1 score of 93%, 93% and 92%, respectively, in a validation set of 262 epilepsy monitoring unit discharge summaries from Cleveland, Ohio, USA. Sullivan et al used a machine-based learning NLP pipeline to identify a rare epilepsy syndrome from discharge summaries and EEG reports in Phoenix, Arizona, USA and achieved a precision, recall and F1 score of 77%, 67% and 71%, respectively.21 The majority of these studies used discharge letters that are generally more structured than the clinic letters used in our study, which presents a greater challenge for NLP application.

Conclusion

Using the GATE framework and the existing applications, we have developed an automated clinical-text extraction system, ExECT, which can accurately extract epilepsy information from free text in clinic letters. This can enhance routinely collected data for epilepsy research in the UK. The types of information extracted using our algorithm such as epilepsy type, seizure frequency and neurological investigation results are often missing from routinely collected data. We propose that our algorithm can be used to fill this data gap, enabling further epilepsy research opportunities. While many of the rules in our pipeline were tailored to extracting epilepsy specific information, the methods employed could be generalised to other disease areas and used in clinical practice to record patient information in a structured manner.

Future work

We are developing ExECT to extract other epilepsy variables including age of seizure onset and co-morbidities. In addition, we aim to deploy ExECT in hospital information systems to enhance the availability of structured clinical data for clinicians.

References

1.↵
1. Hesdorffer DC,
2. Ishihara L,
3. Mynepalli L, et al
. Epilepsy, suicidality, and psychiatric disorders: a bidirectional association. Ann Neurol2012;72:184–91.doi:10.1002/ana.23601
OpenUrl CrossRef PubMed
2.↵
1. Bech BH,
2. Kjaersgaard MI,
3. Pedersen HS, et al
. Use of antiepileptic drugs during pregnancy and risk of spontaneous abortion and stillbirth: population based cohort study. BMJ2014;349:g5159.doi:10.1136/bmj.g5159
OpenUrl Abstract/FREE Full Text
3.↵
1. Fazel S,
2. Wolf A,
3. Långström N, et al
. Premature mortality in epilepsy and the role of psychiatric comorbidity: a total population study. Lancet2013;382:1646–54.doi:10.1016/S0140-6736(13)60899-5
OpenUrl CrossRef PubMed Web of Science
4.↵
1. Wang Y,
2. Wang L,
3. Rastegar-Mojarad M, et al
. Clinical information extraction applications: a literature review. J Biomed Inform2018;77:30256–3.doi:10.1016/j.jbi.2017.11.011
OpenUrl
5.↵
1. Jackson RG,
2. Patel R,
3. Jayatilleke N, et al
. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. BMJ Open2017;7:e012012.doi:10.1136/bmjopen-2016-012012
6.↵
1. Iqbal E,
2. Mallah R,
3. Rhodes D, et al
. ADEPt, a semantically-enriched pipeline for extracting adverse drug events from free-text electronic health records. PLoS One2017;12:e0187121.doi:10.1371/journal.pone.0187121
7.↵
1. Hamid H,
2. Fodeh SJ,
3. Lizama AG, et al
. Validating a natural language processing tool to exclude psychogenic nonepileptic seizures in electronic medical record-based epilepsy research. Epilepsy Behav2013;29:578–80.doi:10.1016/j.yebeh.2013.09.025
OpenUrl
8.↵
1. Chase HS,
2. Mitrani LR,
3. Lu GG, et al
. Early recognition of multiple sclerosis using natural language processing of the electronic health record. BMC Med Inform Decis Mak2017;17:24.doi:10.1186/s12911-017-0418-4
OpenUrl
9.↵
1. Cunningham H,
2. Tablan V,
3. Roberts A, et al
. Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput Biol2013;9:e1002854.doi:10.1371/journal.pcbi.1002854
OpenUrl CrossRef PubMed
10.↵
1. Lindberg DA,
2. Humphreys BL,
3. McCray AT
. The Unified Medical Language System. Methods Inf Med1993;32:281–91.
OpenUrl PubMed Web of Science
11.↵
U.S. National Library of Medicine. The Unified Medical Language System (UMLS). https://www.nlm.nih.gov/research/umls/. (Accessed 26th Jan 2018)
12.↵
1. McInnes BT,
2. Pedersen T,
3. Carlis J
. Using UMLS Concept Unique Identifiers (CUIs) for word sense disambiguation in the biomedical domain. AMIA Annu Symp Proc2007;11:533–7.
OpenUrl
13.↵
1. Perera G,
2. Broadbent M,
3. Callard F, et al
. Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource. BMJ Open2016;6:e008721.doi:10.1136/bmjopen-2015-008721
14.↵
1. Harkema H,
2. Dowling JN,
3. Thornblade T, et al
. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform2009;42:839–51.doi:10.1016/j.jbi.2009.05.002
OpenUrl CrossRef PubMed Web of Science
15.↵
1. Fung KW,
2. Hole WT,
3. Nelson SJ, et al
. Integrating SNOMED CT into the UMLS: an exploration of different views of synonymy and quality of editing. J Am Med Inform Assoc2005;12:486–94.doi:10.1197/jamia.M1767
OpenUrl CrossRef PubMed
16.↵
1. Liao KP,
2. Cai T,
3. Savova GK, et al
. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ2015;350:h1885.doi:10.1136/bmj.h1885
OpenUrl FREE Full Text
17.↵
1. Damotte V,
2. Lizée A,
3. Tremblay M, et al
. Harnessing electronic medical records to advance research on multiple sclerosis. Mult Scler2019;25:408–18.doi:10.1177/1352458517747407
OpenUrl
18.↵
1. Bill R,
2. Pakhomov S,
3. Chen ES, et al
. Automated extraction of family history information from clinical notes. AMIA Annu Symp Proc2014;2014:1709.
OpenUrl
19.↵
1. Cui L,
2. Bozorgi A,
3. Lhatoo SD, et al
. EpiDEA: extracting structured epilepsy and seizure information from patient discharge summaries for cohort identification. AMIA Annu Symp Proc2012;2012:1191–200.
OpenUrl PubMed
20.↵
1. Cui L,
2. Sahoo SS,
3. Lhatoo SD, et al
. Complex epilepsy phenotype extraction from narrative clinical discharge summaries. J Biomed Inform2014;51:272–9.doi:10.1016/j.jbi.2014.06.006
OpenUrl CrossRef
21.↵
1. Sullivan R,
2. Yao R,
3. Jarrar R, et al
. Text classification towards detecting misdiagnosis of an epilepsy syndrome in a pediatric population. AMIA Annu Symp Proc2014;2014:1082–7.
OpenUrl

Footnotes

BF-S and ASL contributed equally.
Contributors WOP, ASL and BF-S were responsible for study design. ASL and BF-S developed the platform with assistance from AR and WOP. WOP, ASL and BF-S validated the algorithm and drafted the initial manuscript. AA, ST, DVF, RAL and MIR provided senior support and supervision, secured the funding and research infrastructure for the project. All authors reviewed and critically revised the manuscript.
Funding We acknowledge the support from the Farr Institute @ CIPHER. The Farr Institute @ CIPHER is supported by a 10-funder consortium: Arthritis Research UK, the British Heart Foundation, Cancer Research UK, The Economic and Social Research Council, The Engineering and Physical Sciences Research Council, The Medical Research Council, The National Institute of Health Research, The Health and Care Research Wales (Welsh Assembly Government), The Chief Scientist Office (Scottish Government Health Directorates) and The Wellcome Trust (MRC Grant No: MR/K006525/1). We also acknowledge the support from the Brain Repair and Intracranial Neurotherapeutics (BRAIN) Unit which is funded by the Health and Care Research Wales. The work has also been supported by the Academy of Medical Sciences (Starter grant for medical sciences, WOP), The Wellcome Trust, The Medical Research Council, British Heart Foundation, Arthritis Research UK, and the Royal College of Physicians and Diabetes UK (SGL016\1069). AR was supported by the National Institute for Health Research (NIHR). This paper represents independent research partly funded by the NIHR Biomedical Research Centre at South London, the Maudsley NHS Foundation Trust and King’s College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Competing interests WOP has receivedclinical research fellow salary support in the form of an unrestricted grantfrom UCB Pharma and has undertaken work commissioned by the biopharmaceuticalcompany. We confirm that we have read the Journal’s position on issues involvedin ethical publication and affirm that this report is consistent with thoseguidelines .
Ethics approval This research was conducted with anonymised and de-identified routinely collected clinic letters and therefore specific ethical approval was not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement We do not have a data sharing agreement for the data, however, we are exploring ways of obtaining patient consent and endeavour to produce a minimum dataset for cross-platform testing.
Author note MIR and WOP were equal senior authors
Patient consent for publication Not required.

[1] 1.↵
Hesdorffer DC,
Ishihara L,
Mynepalli L, et al
. Epilepsy, suicidality, and psychiatric disorders: a bidirectional association. Ann Neurol2012;72:184–91.doi:10.1002/ana.23601
OpenUrl CrossRef PubMed

[2] Hesdorffer DC,

[3] Ishihara L,

[4] Mynepalli L, et al

[5] 2.↵
Bech BH,
Kjaersgaard MI,
Pedersen HS, et al
. Use of antiepileptic drugs during pregnancy and risk of spontaneous abortion and stillbirth: population based cohort study. BMJ2014;349:g5159.doi:10.1136/bmj.g5159
OpenUrl Abstract/FREE Full Text

[6] Bech BH,

[7] Kjaersgaard MI,

[8] Pedersen HS, et al

[9] 3.↵
Fazel S,
Wolf A,
Långström N, et al
. Premature mortality in epilepsy and the role of psychiatric comorbidity: a total population study. Lancet2013;382:1646–54.doi:10.1016/S0140-6736(13)60899-5
OpenUrl CrossRef PubMed Web of Science

[10] Fazel S,

[11] Wolf A,

[12] Långström N, et al

[13] 4.↵
Wang Y,
Wang L,
Rastegar-Mojarad M, et al
. Clinical information extraction applications: a literature review. J Biomed Inform2018;77:30256–3.doi:10.1016/j.jbi.2017.11.011
OpenUrl

[14] Wang Y,

[15] Wang L,

[16] Rastegar-Mojarad M, et al

[17] 5.↵
Jackson RG,
Patel R,
Jayatilleke N, et al
. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. BMJ Open2017;7:e012012.doi:10.1136/bmjopen-2016-012012

[18] Jackson RG,

[19] Patel R,

[20] Jayatilleke N, et al

[21] 6.↵
Iqbal E,
Mallah R,
Rhodes D, et al
. ADEPt, a semantically-enriched pipeline for extracting adverse drug events from free-text electronic health records. PLoS One2017;12:e0187121.doi:10.1371/journal.pone.0187121

[22] Iqbal E,

[23] Mallah R,

[24] Rhodes D, et al

[25] 7.↵
Hamid H,
Fodeh SJ,
Lizama AG, et al
. Validating a natural language processing tool to exclude psychogenic nonepileptic seizures in electronic medical record-based epilepsy research. Epilepsy Behav2013;29:578–80.doi:10.1016/j.yebeh.2013.09.025
OpenUrl

[26] Hamid H,

[27] Fodeh SJ,

[28] Lizama AG, et al

[29] 8.↵
Chase HS,
Mitrani LR,
Lu GG, et al
. Early recognition of multiple sclerosis using natural language processing of the electronic health record. BMC Med Inform Decis Mak2017;17:24.doi:10.1186/s12911-017-0418-4
OpenUrl

[30] Chase HS,

[31] Mitrani LR,

[32] Lu GG, et al

[33] 9.↵
Cunningham H,
Tablan V,
Roberts A, et al
. Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput Biol2013;9:e1002854.doi:10.1371/journal.pcbi.1002854
OpenUrl CrossRef PubMed

[34] Cunningham H,

[35] Tablan V,

[36] Roberts A, et al

[37] 10.↵
Lindberg DA,
Humphreys BL,
McCray AT
. The Unified Medical Language System. Methods Inf Med1993;32:281–91.
OpenUrl PubMed Web of Science

[38] Lindberg DA,

[39] Humphreys BL,

[40] McCray AT

[41] 11.↵
U.S. National Library of Medicine. The Unified Medical Language System (UMLS). https://www.nlm.nih.gov/research/umls/. (Accessed 26th Jan 2018)

[42] 12.↵
McInnes BT,
Pedersen T,
Carlis J
. Using UMLS Concept Unique Identifiers (CUIs) for word sense disambiguation in the biomedical domain. AMIA Annu Symp Proc2007;11:533–7.
OpenUrl

[43] McInnes BT,

[44] Pedersen T,

[45] Carlis J

[46] 13.↵
Perera G,
Broadbent M,
Callard F, et al
. Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource. BMJ Open2016;6:e008721.doi:10.1136/bmjopen-2015-008721

[47] Perera G,

[48] Broadbent M,

[49] Callard F, et al

[50] 14.↵
Harkema H,
Dowling JN,
Thornblade T, et al
. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform2009;42:839–51.doi:10.1016/j.jbi.2009.05.002
OpenUrl CrossRef PubMed Web of Science

[51] Harkema H,

[52] Dowling JN,

[53] Thornblade T, et al

[54] 15.↵
Fung KW,
Hole WT,
Nelson SJ, et al
. Integrating SNOMED CT into the UMLS: an exploration of different views of synonymy and quality of editing. J Am Med Inform Assoc2005;12:486–94.doi:10.1197/jamia.M1767
OpenUrl CrossRef PubMed

[55] Fung KW,

[56] Hole WT,

[57] Nelson SJ, et al

[58] 16.↵
Liao KP,
Cai T,
Savova GK, et al
. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ2015;350:h1885.doi:10.1136/bmj.h1885
OpenUrl FREE Full Text

[59] Liao KP,

[60] Cai T,

[61] Savova GK, et al

[62] 17.↵
Damotte V,
Lizée A,
Tremblay M, et al
. Harnessing electronic medical records to advance research on multiple sclerosis. Mult Scler2019;25:408–18.doi:10.1177/1352458517747407
OpenUrl

[63] Damotte V,

[64] Lizée A,

[65] Tremblay M, et al

[66] 18.↵
Bill R,
Pakhomov S,
Chen ES, et al
. Automated extraction of family history information from clinical notes. AMIA Annu Symp Proc2014;2014:1709.
OpenUrl

[67] Bill R,

[68] Pakhomov S,

[69] Chen ES, et al

[70] 19.↵
Cui L,
Bozorgi A,
Lhatoo SD, et al
. EpiDEA: extracting structured epilepsy and seizure information from patient discharge summaries for cohort identification. AMIA Annu Symp Proc2012;2012:1191–200.
OpenUrl PubMed

[71] Cui L,

[72] Bozorgi A,

[73] Lhatoo SD, et al

[74] 20.↵
Cui L,
Sahoo SS,
Lhatoo SD, et al
. Complex epilepsy phenotype extraction from narrative clinical discharge summaries. J Biomed Inform2014;51:272–9.doi:10.1016/j.jbi.2014.06.006
OpenUrl CrossRef

[75] Cui L,

[76] Sahoo SS,

[77] Lhatoo SD, et al

[78] 21.↵
Sullivan R,
Yao R,
Jarrar R, et al
. Text classification towards detecting misdiagnosis of an epilepsy syndrome in a pediatric population. AMIA Annu Symp Proc2014;2014:1082–7.
OpenUrl

[79] Sullivan R,

[80] Yao R,

[81] Jarrar R, et al

Log in using your username and password

Main menu

Log in using your username and password

You are here

Abstract

Statistics from Altmetric.com

Request Permissions

Strengths and limitations of this study

Introduction

Materials and methods

Study population

Algorithm construction

Measuring performance

Patient and public involvement

Analysis and statistical tests

Results

Discussion

Strengths

Weaknesses

Comparison with other studies

Conclusion

Future work

References

Footnotes

Read the full text or download the PDF:

Log in using your username and password