Article Text

Original research
Evaluation of freely available data profiling tools for health data research application: a functional evaluation review
  1. Ben Gordon1,
  2. Clara Fennessy1,
  3. Susheel Varma1,
  4. Jake Barrett1,
  5. Enez McCondochie2,
  6. Trevor Heritage2,
  7. Oenone Duroe2,
  8. Richard Jeffery2,
  9. Vishnu Rajamani2,
  10. Kieran Earlam3,
  11. Victor Banda4,
  12. Neil Sebire1
  1. 1 Central Team, Health Data Research UK, London, UK
  2. 2 Inspirata Ltd, Tampa, Florida, USA
  3. 3 Cystic Fibrosis Trust, London, UK
  4. 4 Neonatal Data Analysis Unit, Imperial College London Neonatal Medicine Research Group, London, UK
  1. Correspondence to Dr Neil Sebire; neil.sebire{at}hdruk.ac.uk

Abstract

Objectives To objectively evaluate freely available data profiling software tools using healthcare data.

Design Data profiling tools were evaluated for their capabilities using publicly available information and data sheets. From initial assessment, several underwent further detailed evaluation for application on healthcare data using a synthetic dataset of 1000 patients and associated data using a common health data model, and tools scored based on their functionality with this dataset.

Setting Improving the quality of healthcare data for research use is a priority. Profiling tools can assist by evaluating datasets across a range of quality dimensions. Several freely available software packages with profiling capabilities are available but healthcare organisations often have limited data engineering capability and expertise.

Participants 28 profiling tools, 8 undergoing evaluation on synthetic dataset of 1000 patients.

Results Of 28 potential profiling tools initially identified, 8 showed high potential for applicability with healthcare datasets based on available documentation, of which two performed consistently well for these purposes across multiple tasks including determination of completeness, consistency, uniqueness, validity, accuracy and provision of distribution metrics.

Conclusions Numerous freely available profiling tools are serviceable for potential use with health datasets, of which at least two demonstrated high performance across a range of technical data quality dimensions based on testing with synthetic health dataset and common data model. The appropriate tool choice depends on factors including underlying organisational infrastructure, level of data engineering and coding expertise, but there are freely available tools helping profile health datasets for research use and inform curation activity.

  • information management
  • health informatics
  • information technology

Data availability statement

Data are available on reasonable request. Links for all tools tested, as well as documentation from the review are available here: https://github.com/HDRUK/data-utility-tools.

https://creativecommons.org/licenses/by/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Data availability statement

Data are available on reasonable request. Links for all tools tested, as well as documentation from the review are available here: https://github.com/HDRUK/data-utility-tools.

View Full Text

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors BG, SV and NS conceived the study. EM, TH, OD, RJ and VR developed the methodology further, evaluated the tools and provided the initial results. KE and VB tested the tools on their own datasets and provided feedback on results. NS, BG, CF and JB prepared and drafted the manuscript. The guarantor of the content is NS.

  • Funding This work was supported by Medical Research Council capital funding (August 2019). There is no grant number associated with capital fund awards.

  • Competing interests EM, TH, OD, RJ and VR were employed by Inspirata at the time of the work but were contracted by HDR UK to carry out this work independently on behalf of HDR UK.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.