Article Text

Protocol
Developing an ethical framework-guided instrument for assessing bias in EHR-based Big Data studies: a research protocol
  1. Shan Qiao1,
  2. George Khushf2,
  3. Xiaoming Li3,
  4. Jiajia Zhang4,
  5. Bankole Olatosi5
  1. 1Health Promotion Education and Behavior, University of South Carolina, Columbia, South Carolina, USA
  2. 2Department of Philosophy, University of South Carolina, Columbia, South Carolina, USA
  3. 3Arnold School of Public Health, University of South Carolina, Columbia, South Carolina, USA
  4. 4Department of Epidemiology and Biostatistics, Arnold School of Public Health, South Carolina College of Pharmacy - University of South Carolina Campus, Columbia, South Carolina, USA
  5. 5Health Services, Policy and Management, University of South Carolina Arnold School of Public Health, Columbia, South Carolina, USA
  1. Correspondence to Dr Bankole Olatosi; olatosi{at}mailbox.sc.edu

Abstract

Introduction The emergence of Big Data health research has exponentially advanced the fields of medicine and public health but has also faced many ethical challenges. One of most worrying but still under-researched aspects of the ethical issues is the risk of potential biases in data sets (eg, electronic health records (EHR) data) as well as in the data curation and acquisition cycles. This study aims to develop, refine and pilot test an ethical framework-guided instrument for assessing bias in Big Data research using EHR data sets.

Methods and analysis Ethical analysis and instrument development (ie, the EHR bias assessment guideline) will be implemented through an iterative process composed of literature/policy review, content analysis and interdisciplinary dialogues and discussion. The ethical framework and EHR bias assessment guideline will be iteratively refined and integrated with preliminary summaries of results in a way that informs subsequent research. We will engage data curators, end-user researchers, healthcare workers and patient representatives throughout all iterative cycles using various formats including in-depth interviews of key stakeholders, panel discussions and charrette workshops. The developed EHR bias assessment guideline will be pilot tested in an existing National Institutes of Health (NIH) funded Big Data HIV project (R01AI164947).

Ethics and dissemination The study was approved by Institutional Review Boards at the University of South Carolina (Pro00122501). Informed consent will be provided by the participants in the in-depth interviews. Study findings will be disseminated with key stakeholders, presented at relevant workshops and academic conferences, and published in peer-reviewed journals.

  • MEDICAL ETHICS
  • HIV & AIDS
  • PUBLIC HEALTH
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This study will advance our understanding of bias and equity issues in Big Data research and develop an ethical framework and a guideline for assessing bias in electronic health records (EHR)-based Big Data studies.

  • This study will combine perspectives of both ethical study and data science and take advantage of integration of literature and qualitative data through the integrative process.

  • The developed ethical framework and EHR bias assessment guideline will be pilot-tested within an ongoing EHR project.

  • The in-depth interviews will be conducted among the key stakeholders in South Carolina, but this may not reflect the full range of insights from researchers and key stakeholders engaged in Big Data health research in other social contexts.

Introduction

The emergence of Big Data health research, characterised by tremendously large electronic health records (EHR) data sets and computational technologies such as artificial intelligence (AI) and machine learning (ML),1 has exponentially advanced the fields of medicine and public health by making possible a better understanding regarding social determinants of health; discovering novel treatments; and mapping the underlying mechanisms, markers and progression of disease.2–6 While widely used in diagnosis, clinical decision-making and personalised medicine, AI/ML, as a collection of data-driven technologies, has raised a novel set of ethical challenges, including respecting patient autonomy, adequate consent, identifiability and privacy protection and data ownership, sharing and reuse.7–10 In alignment with the FAIR principles for scientific data (ie, Findable, Accessible, Interoperable, Reusable),11 the usage of EHR data in biomedical and behavioural research should be guided by a sound ethical framework with steps taken to minimise unintended harm that could result from Big Data health research.12–14

Current policy and ethical guidelines for Big Data research, however, lag behind the technological progress being made in the healthcare field.15–17 One of the ethical challenges encountered by Big Data research using EHR data is how to assess potential biases in its data curation, acquisition, linkage and integration.18–21 For example, when data are predominately obtained from a single group, based on race/ethnicity, country of origin or socioeconomic status, the research can help over-represented populations, while not benefitting, and even potentially harming under-represented populations (group harm).22–24 In addition to unrepresentative data, Big Data research may challenge equity through AI/ML algorithms trained using biased data (eg, data with a large number of missing/incomplete records).25 26 The biased results can perpetuate existing health disparities and may even automate structural discrimination resulting in group harm.27–29

Despite increased concerns about potential biases in EHR data and data acquisition processes, very little Big Data research using EHR data reports biases in data or data acquisition and/or mining as an indicator of the research quality.30 Such a limitation largely results from several knowledge gaps: (1) Lack of an ethical framework as a theoretical ground to study the bias in EHR data and/or necessary mitigation strategies; (2) lack of standardised measurement instruments or guideline to assess to what extent biases intentionally and/or accidentally emerge from the multiple steps of the Big Data curation cycle; and (3) lack of effective interdisciplinary collaboration that engages ethics experts, professional data curators, data management experts, data repository administrators, healthcare workers and state agencies in discussions addressing this ethical challenge.

To address the existing knowledge gaps in the ethical development of Big Data health research, the main goal of our study is to develop an ethical framework-guided instrument for assessing biases in EHR data with the following aims: (1) To develop an ethical framework for unbiased and inclusive Big Data research which will guide the development of an instrument in this study as well as future work in developing ethical principles and standards for Big Data health research; (2) to create and modify the instrument (ie, EHR bias assessment guideline) to assess potential biases in EHR studies; and (3) to pilot test the EHR bias assessment guideline for its applicability in an ongoing NIH-funded Big Data HIV project (R01AI164947; see online supplemental appendix 1 for a brief project description).

Methods and analysis

Conceptual framework of the study

The blueprint of our study can be presented by a conceptual framework (figure 1). Ethical analysis and EHR bias assessment guideline development will be implemented through an iterative process composed of literature/policy review, conceptual analysis and interdisciplinary dialogues and discussion. The ethical framework and EHR bias assessment guideline will be iteratively refined and integrated with preliminary summaries of results in a way that informs subsequent research. We will engage data curators, end-user researchers, healthcare workers and patient representatives throughout all iterative cycles using various formats including in-depth interviews of key stakeholders, panel discussions and charrette workshops. An initial conceptual analysis regarding bias issues in Big Data research based on literature/policy review will inform the ethical framework and EHR bias assessment guideline development. The rich evidence based on in-depth interviews, interdisciplinary dialogues and community charrette of diverse key stakeholders regarding realistic constraints and potential actions will be the pragmatic, reality stratum. The dialogues and integration of multiple iterative cycles will lead to refined versions of the ethical framework and the EHR bias assessment guideline. We will then pilot test and finalise this guideline in one ongoing Big Data project. The project is planned to be implemented from August 2022 to August 2023.

Figure 1

Conceptual framework of the study.

Literature/policy review

Literature/policy review will be conducted as a ground for developing the initial ethical framework and a metric tool. We plan to search at least six databases (PsycINFO, SocINDEX, PhilPapers, CINAHL, PubMed and Web of Science) using search terms such as ‘Big Data’, ‘data mining’, ‘algorithms’, ‘bias’, ‘ethic*’, ‘electronic *record’, ‘inclusive’, ‘equity’, ‘equality’ and ‘*justice’. We will also include common qualifiers for health inequalities, such as gender, race, ethnicity, socioeconomic status and stigma to produce relevant search results because they are more specific when identifying sources of bias. These terms will be combined using Boolean logic. The inclusion criteria will be (1) papers published in English; (2) papers related to Big Data research; and (3) papers focused on ‘bias’ or ‘equity’ issues. That is, we will include all the studies about biases in EHR studies using the Big Data approach, regardless of whether or not the studies explicitly investigate the relationship between bias and equity. To obtain a broader understanding of this ethical challenge related to Big Data research, no restrictions will be placed on the discipline of the papers or on the type of methodology. To examine current legal/ethical frameworks and principles used in guiding data-driven research, especially EHR-based studies, we will search and review relevant guidance, laws and regulations7 (see data sources in table 1).

Table 1

Data resources of the policy review

All selected literature/documents will be reviewed using the thematic synthesis method with three steps: coding of text ‘line-by-line’, development of ‘descriptive themes’ and the generation of ‘analytical themes’.31 Research assistants with training in ethics and experience in Big Data research will conduct literature/policy review under supervision. Any disagreement about paper screening and information synthesising will be resolved by team discussion and decided by principal investigator (PI).

Conceptual analysis

Grounded in the results of the literature/policy review, we will conduct a conceptual analysis with aims of clarifying concepts of ‘bias’ and ‘equity’ associated with Big Data research so we can develop an ethical framework to guide identifying and measuring bias. Rodgers’s well-established method of conceptual analysis will be employed, which presents an inductive, dynamic view of a phenomenon.32 Data extracted from the relevant literature will be categorised as: (1) defining attributes (characteristics of the concept); (2) antecedents; and (3) consequences. Verbatim statements from each article will be tabulated. An inductive analysis of ‘bias’ and ‘equity’ will produce descriptive themes. We will identify what types of biases in EHR data curation, acquisition and process are key issues from the ethical perspectives. These biases will be a focus of the EHR bias assessment guideline to be developed.

In-depth interviews

Since key stakeholders may have diverse opinions and perspectives about the ‘unbiased’ Big Data study, we believe that in-depth interviews will be an appropriate approach to collecting qualitative data, which will offer opportunities for one-on-one, in-depth conversations with minimum influence of others on the interviewee.

The key stakeholders of the ongoing Big Data project include but are not limited to: (1) Two key state partners: the South Carolina Department of Health and Environmental Control (SC DHEC) and the South Carolina Revenue and Fiscals Affairs Office (SC RFA), who have been actively involved in our research as partners since 2017 through multiple EHR-based Big Data research projects; and (2) a functional Stakeholder Advisory Board (SAB), which includes five to seven members representing the relevant stakeholders (eg, SC DHEC, HIV treatment and care physicians and People living with HIV/AIDS (PLWH).

We will purposely recruit about 20 participants among the key stakeholders for the interviews including data scientists and research staff (n=5), healthcare workers in HIV clinics (n=5), representatives of relevant state agencies (n=5) and patient representatives (ie, people living with HIV) (n=5) in SC. The research team will contact and recruit the participants based on recommendations by the SAB. Written consent will be obtained prior to the interviews. A semi-structured qualitative interview guide will be created by researchers who have extensive experience working with key stakeholders. The questions will be tailored for different types of participants but will generally focus on perceptions and understandings of bias and equity in the context of Big Data research, the criteria of an unbiased and quality Big Data study, potential causes of bias in EHR-based studies, the challenges and barriers to conducting unbiased Big Data studies and the possible solutions or policies for addressing these problems. Additional topics will be included as appropriate and as informed by the ethical framework. With appropriate consent, the interviews will be audio-recorded. Interviewers will take field notes during the interviews to serve as a complementary data source. Each interview will take 1 hour led by trained interviewers in a privacy room or via online conferencing per request.

All interviews will be transcribed verbatim and entered into software NVivo by research staff. Preliminary coding will begin by reading and rereading five transcripts. A codebook will then be developed to include both deductive (ie, the themes drawn from the conceptual framework) and inductive (ie, the new themes emerging from the interviews) codes. Two research staff will independently code each of the transcripts using the codebook. Any coding disagreements will be discussed and resolved using a consensus model of team-based coding. Quote excerpts and coding memos will be developed according to themes. Representative and verbatim quotes will be selected to illustrate key findings.33 34

Interdisciplinary dialogue through charrette workshops

Charrette workshops with scholars and key stakeholders will be organised to promote interdisciplinary dialogues and discussions about the framing of ethical issues and the development of the EHR bias assessment guideline.

As a community engagement strategy recommended by the National Minority AIDS Council, a charrette is a collaborative planning process that purposefully brings together the expertise of community and academic research partners to strengthen partnerships, engage stakeholders and make decisions regarding translational research.35 We will invite ethics experts, professional data curators, data management experts, data repository administrators, healthcare workers, representatives of relevant state agencies and PLWH through the SAB of the ongoing Big Data project. To obtain a wider healthcare perspective, we will also invite experts in other health conditions and/or from a broader academic network of Big Data health studies leveraging the Big Data Health Science Center in our institute. We will assemble a panel of 10–15 experts for a 1-day workshop to discuss the draft EHR bias assessment guideline. Panellists will receive the guideline draft 2 weeks prior to the charrette and be asked to review and provide feedback on its content, structure and format. The charrette will be held in a University of South Carolina conference room or conducted in a Zoom platform using the ‘breakout discussion room’ function, depending on the logistics. The charrette will begin with a review of the charrette goals and an explanation of the procedures for the day. It will be highlighted at the very beginning of the workshop that there are no ‘right’ or ‘wrong’ answers and that principles of respect and openness during the dialogues create a safe and comfortable environment for discussion. Panellists will be divided into groups of 3–4. The research team will co-lead each of the small group discussions. Each group will discuss the same sets of questions that are based on the charrette objective (eg, feedback on the tool, strengths and weakness, additional content), and a co-leader will record the primary points on a discussion board. After completing the small group discussions, the full group will reconvene and a representative from each group will present their findings; other members will ask questions for points of clarification, and additional information will be added to the discussion board if needed. The discussion notes will become the primary data source. Field notes will be taken during the charrette by two research staff, with observational and interpretive elements. At the end of the charrette, the research team will engage in a process of critical reflection regarding the group and develop combined reflection notes based on these conversations.

Ethical framework and guideline development: an integrative process

The development of the ethical framework and EHR bias assessment guideline will be an integrative process informed by all the knowledge and qualitative research obtained through literature/policy review, in-depth interviews and charrette workshops. Specifically, the literature/policy reviews will advance our understanding of the landscape of ethical development and relevant topics and debates about using EHR data in healthcare research. The interdisciplinary communication and discussions among key stakeholders will further help us to identify the key types of biases in EHR-based studies that have ethical implications in core values such as ‘social bias’, ‘equity’ and ‘justice’. The initial ethical framework and EHR bias assessment guideline will be refined based on multiple iterative cycles in which preliminary summaries of results (based on literature/policy review and qualitative studies) inform research in subsequent steps until the research team believe that the refined version of ethical framework and the EHR bias assessment guideline comprehensively reflects and integrates key issues based on both ‘normative stratum’ and ‘reality’ stratum.

Grounded in the reviews and the ethical framework, we could focus on the ‘ideal’, normative stratum of the EHR bias assessment guideline, that is, ‘what should be’. The rich qualitative evidence from the key stakeholders among the front-line data curators, healthcare providers and end-user researchers and patients can be used to build up the ‘reality’ stratum of the EHR bias assessment guideline, that is, ‘what realistic constraints’ and ‘what could be’, to ensure that this assessment guideline is applicable and reasonable in real-world practice. The lived experiences, reflection and lessons from data curators, management experts and repository administrators will assist us in criteria/standards selection and adaptation. The research team will actively participate in the discussions to integrate the two strata of the metric tool and ultimately develop the EHR bias assessment guideline, that is, informed by the ethical framework and also rooted in the realities of EHR data curation, acquisition, process and usage in health fields.

Extant literature suggests that strategies for resolving potential biases in EHR studies are context dependent. It is not unusual that one approach of addressing/adjusting for one type of bias may cause another type of bias due to the complexities of the clinical data set and the healthcare system. Therefore, the EHR bias assessment guideline will not aim to provide comprehensive resolving approaches or methods or cover all types of biases. Rather, it is more like a checklist of potential key biases in EHR studies from data acquisition, data integration and data process. It focuses on key bias issues from the ethical perspective (eg, related to equity), and will follow the format of several widely accepted assessment instruments of the quality of clinical trials.36–39 Scores will reflect the level of concern that the researchers identify and assess the ethically important biases in key steps of the data curation, acquisition and analysis cycle. Therefore, this guideline will assist Big Data researchers in knowing, assessing, reporting and resolving potential biases in EHR data.

Pilot testing the EHR bias assessment guideline

The pilot test using the EHR data in the existing Big Data project will focus on (1) assessing if the criteria/standards of the EHR bias assessment guideline are applicable and reasonable in the specific setting and scenario of the project; (2) adapting and refining the content and format of the EHR bias assessment guideline to ensure that this instrument is valid and reliable when used in a real-world practice setting and the language is precise, accurate and easy to follow; and (3) identifying any additional criteria or items that are needed for the existing instrument based on complex data curation of the project. The pilot test will be based on data sources in the existing project, including study protocols and all of the relevant documents (eg, data dictionaries, contracts with the SC RFA) of the ongoing Big Data project; meeting records of quarterly SAB meetings since the beginning of the project; minutes of research team meetings of the Big Data project; and preliminary data from the ongoing Big Data project if available. Through reviewing the research documents of the ongoing Big Data project, we will assess the potential bias in EHR data curation and processing using the EHR bias assessment guideline. We will discuss the findings with the research team and SAB to contextualise the findings (eg, the level of bias) in the real-world settings of acquiring and using EHR data in the ongoing Big Data project to address research questions on viral suppression. Finally, we will hold multiple group discussions with the research team and stakeholders of the ongoing Big Data project, key informants who participated in previous phases of this study (eg, in-depth interview), as well as external experts to go through the EHR bias assessment guideline and findings and collect feedback and suggestions for refinement. We will use an iterative process with interactive strategies, whereby notes of discussions taken during each meeting will be triangulated with other notes of document reviews in finalising the EHR bias assessment guideline.

Patient and public involvement

Key stakeholders of the proposed study will be involved in study design, conduct and reporting of our research. We will actively reach out to patients and public in the dissemination of our findings.

Ethics and dissemination

The study was approved by the nstitutional Review Board at the University of South Carolina (Pro00122501). Informed consent will be provided by the participants in the in-depth interviews. Study findings will be disseminated with key stakeholders, presented at relevant workshops, academic conferences and published in peer-reviewed journals.

Discussion

Our study has several strengths. First, the EHR bias assessment guideline will be informed by an ethical framework. An assessment guideline informed by an ethical framework will integrate ethical principles and technical realities and thus promote both ethical development and EHR data application in healthcare fields. The dialogues and communications between ethics and data science will increase the awareness of ethical challenges among key stakeholders. Second, we have an interdisciplinary team that includes ethics and data science experts, social scientists, state-wide data repository managers and HIV experts and clinicians with a proven history of working collaboratively in publication and grant application. This team will be able to comprehensively understand the bias issue rather than isolate questions of bias ‘in’ data sets. Third, we will apply sequential use of multimethod data collection (literature review, in-depth interviews, community charrettes via a workshop) and analysis strategies in a participatory manner, which allows for unique mixed-methods findings. The engagement of diverse key stakeholders in the data collection will assure that multiple voices from various communities (healthcare providers, healthcare agencies, government, patients and researchers) will be incorporated and given priority. Fourth, we will also invite external experts and key informants in the EHR bias assessment guideline development and pilot testing to obtain a wider perspective of public health beyond the HIV-related project.

However, our study has several limitations. First, in-depth interviews will be conducted among the key stakeholders in South Carolina and this may not reflect the insights of researchers and key stakeholders engaged in Big Data health research in other social contexts. Second, although we try to broaden our perspectives by engaging experts and key stakeholders in various public health areas, the pilot test will be applied in an ongoing HIV project. Therefore, future studies may need to further refine and modify the ethnical framework and EHR bias assessment guideline so they can be adopted to more disease conditions.

Despite these limitations, the outputs of our study will advance our understanding of bias and equity issues in Big Data research, develop an ethical framework and an EHR bias assessment guideline for assessing bias in EHR-based Big Data studies, and thus lead to and inform a more nuanced assessment and exploration of bias in practice for ethical development of Big Data health research beyond the existing Big Data project. The guideline can be reused as an assessment instrument to detect and quantify bias, which may contribute to improving awareness and exploration of this critical ethical challenge. The ethical framework may also provide insights and guidance for addressing bias issues in Big Data using other types of data beyond EHR.

Ethics statements

Patient consent for publication

Acknowledgments

The authors thank SC Department of Health and Environmental Control, SC Revenue and Fiscal Affairs Office and other SC agencies as key stakeholders of the proposed study. The project is funded by National Institute of Allergy and Infectious Diseases (NIH/NIAD) under award number R01AI164947.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors SQ, XL, BO and GK conceptualised the study and SQ wrote the first draft. XL, JZ and GK participated in writing sections of the original proposal. All authors critically reviewed and edited the manuscript. BO completed the Institutional Review Board approvals. BO and JZ secured the funding.

  • Funding Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases (NIH/NIAD) under award number R01AI164947.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the Methods section for further details.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.