Article Text

Download PDFPDF

Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system
  1. Beata Fonferko-Shadrach1,
  2. Arron S Lacey1,2,
  3. Angus Roberts3,
  4. Ashley Akbari2,
  5. Simon Thompson2,
  6. David V Ford2,
  7. Ronan A Lyons2,
  8. Mark I Rees1,4,
  9. William Owen Pickrell1
  1. 1Neurology and Molecular Neuroscience Group, Institute of Life Science, Swansea University Medical School, Swansea University, Swansea, UK
  2. 2Health Data Research UK, Data Science Building, Swansea University Medical School, Swansea University, Swansea, UK
  3. 3Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
  4. 4Faculty of Medicine and Health, University of Sydney, Sydney, Australia
  1. Correspondence to Dr William Owen Pickrell; w.o.pickrell{at}


Objective Routinely collected healthcare data are a powerful research resource but often lack detailed disease-specific information that is collected in clinical free text, for example, clinic letters. We aim to use natural language processing techniques to extract detailed clinical information from epilepsy clinic letters to enrich routinely collected data.

Design We used the general architecture for text engineering (GATE) framework to build an information extraction system, ExECT (extraction of epilepsy clinical text), combining rule-based and statistical techniques. We extracted nine categories of epilepsy information in addition to clinic date and date of birth across 200 clinic letters. We compared the results of our algorithm with a manual review of the letters by an epilepsy clinician.

Setting De-identified and pseudonymised epilepsy clinic letters from a Health Board serving half a million residents in Wales, UK.

Results We identified 1925 items of information with overall precision, recall and F1 score of 91.4%, 81.4% and 86.1%, respectively. Precision and recall for epilepsy-specific categories were: epilepsy diagnosis (88.1%, 89.0%), epilepsy type (89.8%, 79.8%), focal seizures (96.2%, 69.7%), generalised seizures (88.8%, 52.3%), seizure frequency (86.3%–53.6%), medication (96.1%, 94.0%), CT (55.6%, 58.8%), MRI (82.4%, 68.8%) and electroencephalogram (81.5%, 75.3%).

Conclusions We have built an automated clinical text extraction system that can accurately extract epilepsy information from free text in clinic letters. This can enhance routinely collected data for research in the UK. The information extracted with ExECT such as epilepsy type, seizure frequency and neurological investigations are often missing from routinely collected data. We propose that our algorithm can bridge this data gap enabling further epilepsy research opportunities. While many of the rules in our pipeline were tailored to extract epilepsy specific information, our methods can be applied to other diseases and also can be used in clinical practice to record patient information in a structured manner.

  • natural language processing
  • epilepsy
  • validation
  • information extraction

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See:

View Full Text

Statistics from


  • BF-S and ASL contributed equally.

  • Contributors WOP, ASL and BF-S were responsible for study design. ASL and BF-S developed the platform with assistance from AR and WOP. WOP, ASL and BF-S validated the algorithm and drafted the initial manuscript. AA, ST, DVF, RAL and MIR provided senior support and supervision, secured the funding and research infrastructure for the project. All authors reviewed and critically revised the manuscript.

  • Funding We acknowledge the support from the Farr Institute @ CIPHER. The Farr Institute @ CIPHER is supported by a 10-funder consortium: Arthritis Research UK, the British Heart Foundation, Cancer Research UK, The Economic and Social Research Council, The Engineering and Physical Sciences Research Council, The Medical Research Council, The National Institute of Health Research, The Health and Care Research Wales (Welsh Assembly Government), The Chief Scientist Office (Scottish Government Health Directorates) and The Wellcome Trust (MRC Grant No: MR/K006525/1). We also acknowledge the support from the Brain Repair and Intracranial Neurotherapeutics (BRAIN) Unit which is funded by the Health and Care Research Wales. The work has also been supported by the Academy of Medical Sciences (Starter grant for medical sciences, WOP), The Wellcome Trust, The Medical Research Council, British Heart Foundation, Arthritis Research UK, and the Royal College of Physicians and Diabetes UK (SGL016\1069). AR was supported by the National Institute for Health Research (NIHR). This paper represents independent research partly funded by the NIHR Biomedical Research Centre at South London, the Maudsley NHS Foundation Trust and King’s College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

  • Competing interests WOP has receivedclinical research fellow salary support in the form of an unrestricted grantfrom UCB Pharma and has undertaken work commissioned by the biopharmaceuticalcompany. We confirm that we have read the Journal’s position on issues involvedin ethical publication and affirm that this report is consistent with thoseguidelines .

  • Ethics approval This research was conducted with anonymised and de-identified routinely collected clinic letters and therefore specific ethical approval was not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement We do not have a data sharing agreement for the data, however, we are exploring ways of obtaining patient consent and endeavour to produce a minimum dataset for cross-platform testing.

  • Author note MIR and WOP were equal senior authors

  • Patient consent for publication Not required.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.