Article Text

Download PDFPDF

Accessing primary care Big Data: the development of a software algorithm to explore the rich content of consultation records
  1. J MacRae1,
  2. B Darlow2,
  3. L McBain2,
  4. O Jones3,
  5. M Stubbe2,
  6. N Turner4,
  7. A Dowell2
  1. 1Patients First, Wellington, New Zealand
  2. 2Department of Primary Health Care and General Practice, University of Otago, Wellington, New Zealand
  3. 3Compass Health Wellington Trust, Wellington, New Zealand
  4. 4Department of General Practice and Primary Care, University of Auckland, Auckland, New Zealand
  1. Correspondence to Professor A Dowell; tony.dowell{at}


Objective To develop a natural language processing software inference algorithm to classify the content of primary care consultations using electronic health record Big Data and subsequently test the algorithm's ability to estimate the prevalence and burden of childhood respiratory illness in primary care.

Design Algorithm development and validation study. To classify consultations, the algorithm is designed to interrogate clinical narrative entered as free text, diagnostic (Read) codes created and medications prescribed on the day of the consultation.

Setting Thirty-six consenting primary care practices from a mixed urban and semirural region of New Zealand. Three independent sets of 1200 child consultation records were randomly extracted from a data set of all general practitioner consultations in participating practices between 1 January 2008–31 December 2013 for children under 18 years of age (n=754 242). Each consultation record within these sets was independently classified by two expert clinicians as respiratory or non-respiratory, and subclassified according to respiratory diagnostic categories to create three ‘gold standard’ sets of classified records. These three gold standard record sets were used to train, test and validate the algorithm.

Outcome measures Sensitivity, specificity, positive predictive value and F-measure were calculated to illustrate the algorithm's ability to replicate judgements of expert clinicians within the 1200 record gold standard validation set.

Results The algorithm was able to identify respiratory consultations in the 1200 record validation set with a sensitivity of 0.72 (95% CI 0.67 to 0.78) and a specificity of 0.95 (95% CI 0.93 to 0.98). The positive predictive value of algorithm respiratory classification was 0.93 (95% CI 0.89 to 0.97). The positive predictive value of the algorithm classifying consultations as being related to specific respiratory diagnostic categories ranged from 0.68 (95% CI 0.40 to 1.00; other respiratory conditions) to 0.91 (95% CI 0.79 to 1.00; throat infections).

Conclusions A software inference algorithm that uses primary care Big Data can accurately classify the content of clinical consultations. This algorithm will enable accurate estimation of the prevalence of childhood respiratory illness in primary care and resultant service utilisation. The methodology can also be applied to other areas of clinical care.


This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.