Article Text

Original research
Can synthetic data be a proxy for real clinical trial data? A validation study
  1. Zahra Azizi1,
  2. Chaoyi Zheng2,
  3. Lucy Mosquera2,
  4. Louise Pilote3,4,
  5. Khaled El Emam5,6
  6. GOING-FWD Collaborators
    1. 1Center for Outcomes Research and Evaluation, Faculty of Medicine, McGill University, Montreal, Québec, Canada
    2. 2Data Science, Replica Analytics Ltd, Ottawa, Ontario, Canada
    3. 3Medicine, McGill University, Montreal, Québec, Canada
    4. 4Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, Montreal, Québec, Canada
    5. 5Electronic Health Information Laboratory, Children’s Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
    6. 6School of Epidemiology and Public Health, University of Ottawa, Ottawa, Ontario, Canada
    1. Correspondence to Dr Khaled El Emam; kelemam{at}ehealthinformation.ca

    Abstract

    Objectives There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.

    Setting Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.

    Participants There were 1543 patients in the control arm that were included in our analysis.

    Primary and secondary outcome measures Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.

    Results Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).

    Conclusions The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.

    Trial registration number NCT00079274.

    • epidemiology
    • health informatics
    • information management
    • information technology
    • statistics & research methods

    Data availability statement

    Data may be obtained from a third party and are not publicly available. The dataset can be obtained by registering at Project Data Sphere (https://www.projectdatasphere.org/).

    http://creativecommons.org/licenses/by-nc/4.0/

    This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

    Statistics from Altmetric.com

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

    Data availability statement

    Data may be obtained from a third party and are not publicly available. The dataset can be obtained by registering at Project Data Sphere (https://www.projectdatasphere.org/).

    View Full Text

    Supplementary materials

    • Supplementary Data

      This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    • Supplementary Data

      This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Footnotes

    • Twitter @kelemam

    • Collaborators GOING‑FWD Collaborators

      Co-Principal Investigators

      Louise Pilote; McGill University Health Center and McGill University, Canada Colleen M. Norris; University of Alberta, Canada Valeria Raparelli; University of Ferrara, Italy

      Site Principal Investigators

      Alexandra Kautzky-Willer; Medical University of Vienna, Austria Karolina Kublickiene; Karolinska Institutet, Sweden Maria Trinidad Herrero; Universidad de Murcia, Spain

      Co-Investigators

      Karin Humphries; University of British Columbia, Canada Monica Parry; Lawrence S. Bloomberg Faculty of Nursing, Canada Ruth Sapir-Pichhadze; McGill University Health Center and McGill University, Canada Michal Abrahamowicz; McGill University Health Center and McGill University, Canada Khaled El Emam; University of Ottawa, Canada Simon Bacon; Concordia University, Canada Peter Klimek; Medical University of Vienna, Austria Jennifer Fishman, McGill University, Canada.

    • Contributors ZA and KEE contributed to designing the study, performing the analysis and writing the paper. LM contributed to designing the study, implemented some of the code used to perform the analysis and writing the paper. CZ contributed to designing the study, data analysis and writing the paper, and implemented some of the code used to perform the analysis. LP contributed to designing the study and writing the paper.

    • Funding This work was partially funded by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada (grant number RGPIN-2016-06781), by Replica Analytics Ltd and by the GOING FWD Consortium. The GOING-FWD Consortium is funded by the GENDER-NET Plus ERA-NET Initiative (project reference number GNP-78). ZA is funded by the Canadian Institutes of Health Research (GNP-161904).

    • Competing interests This work was performed in collaboration with Replica Analytics Ltd. This company is a spin-off from the Children’s Hospital of Eastern Ontario Research Institute. KEE is cofounder and has equity in this company. LM and CZ are data scientists employed by Replica Analytics Ltd.

    • Provenance and peer review Not commissioned; externally peer reviewed.

    • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.