Article Text

Download PDFPDF

Curating a knowledge base for individuals with coinfection of HIV and SARS-CoV-2: a study protocol of EHR-based data mining and clinical implementation
  1. Chen Liang1,2,
  2. Sharon Weissman2,3,
  3. Bankole Olatosi1,2,
  4. Eric G Poon4,
  5. Michael E Yarrington4,
  6. Xiaoming Li2,5
  1. 1Department of Health Services Policy and Management, University of South Carolina, Columbia, South Carolina, USA
  2. 2Big Data Health Science Center, University of South Carolina, Columbia, South Carolina, USA
  3. 3Department of Internal Medicine, University of South Carolina, Columbia, South Carolina, USA
  4. 4Department of Medicine, Duke University, Durham, North Carolina, USA
  5. 5Department of Health Promotion Education and Behavior, University of South Carolina, Columbia, South Carolina, USA
  1. Correspondence to Dr Chen Liang; cliang{at}


Introduction Despite a higher risk of severe COVID-19 disease in individuals with HIV, the interactions between SARS-CoV-2 and HIV infections remain unclear. To delineate these interactions, multicentre Electronic Health Records (EHR) hold existing promise to provide full-spectrum and longitudinal clinical data, demographics and sociobehavioural data at individual level. Presently, a comprehensive EHR-based cohort for the HIV/SARS-CoV-2 coinfection has not been established; EHR integration and data mining methods tailored for studying the coinfection are urgently needed yet remain underdeveloped.

Methods and analysis The overarching goal of this exploratory/developmental study is to establish an EHR-based cohort for individuals with HIV/SARS-CoV-2 coinfection and perform large-scale EHR-based data mining to examine the interactions between HIV and SARS-CoV-2 infections and systematically identify and validate factors contributing to the severe clinical course of the coinfection. We will use a nationwide EHR database in the USA, namely, National COVID Cohort Collaborative (N3C). Ultimately, collected clinical evidence will be implemented and used to pilot test a clinical decision support prototype to assist providers in screening and referral of at-risk patients in real-world clinics.

Ethics and dissemination The study was approved by the institutional review boards at the University of South Carolina (Pro00121828) as non-human subject study. Study findings will be presented at academic conferences and published in peer-reviewed journals. This study will disseminate urgently needed clinical evidence for guiding clinical practice for individuals with the coinfection at Prisma Health, a healthcare system in collaboration.

  • COVID-19
  • HIV & AIDS
  • Health informatics

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This study will be among the first that systematically integrates HIV viral suppression status, antiretroviral therapy (ART) adherence, vaccination, sociobehavioural and social determinants of health with full-spectrum clinical characteristics for individuals with HIV/SARS-CoV-2 coinfection.

  • Our methods can explain the role of temporal dependency among patients’ underlying conditions, comorbidities, ART adherence, vaccine exposure and received therapeutics in individuals’ heterogeneous responses to the coinfection.

  • Our methods support real-time prediction of coinfected individuals’ clinical outcomes, disease progression, prognosis, and risk factors of adverse events.

  • The proposed methods are highly innovative in that they are designed to extract temporal sequences and temporal properties of every clinical event from Electronic Health Records and are fully capable of embedding the temporal data into machine learning models.

  • The study only includes pilot validity and usability testing of the proposed clinical decision support prototype due to limited time for an exploratory/developmental study.


The COVID-19 pandemic has cast a heavy burden on individuals with HIV infection. Based on data from 15 522 hospitalised patients with the coinfection of HIV and SARS-CoV-2 from 24 countries, a recent WHO report for the first time confirmed that HIV is a key risk factor for severe COVID-19.1 The severity of COVID-19 in individuals with HIV is correlated with certain comorbidities (eg, type 2 diabetes mellitus, cardiovascular diseases, obesity, chronic obstructive pulmonary diseases, chronic kidney diseases, and some cancers) in which some comorbidities are more prevalent in people living with HIV (PLWH). Individuals with low CD4+ T-cell count (eg, <200 cells/µL2 or <5003 cells/µL) and unsuppressed viral load, and prolonged antiretroviral therapy (ART) exposure are associated with severe clinical course. These clinical facts are further complicated by the disrupted HIV healthcare services (eg, access to HIV testing, ART and distribution of pre-exposure prophylaxis and post-exposure prophylaxis).4

Despite a generally high risk of severe COVID-19 clinical course in PLWH, the interactions between SARS-CoV-2 and HIV infections remain unclear. First, several contradictory findings suggested the predominant role of comorbidities in severity of COVID-19 regardless of HIV infection.5–8 Second, risk factors for the severe clinical course of the coinfection are undetermined because individuals with the same or similar severity level of COVID-19 show different clinical characteristics.4 Third, the role of ART adherence and HIV viral suppression status in the context of COVID-19 exposure is undetermined. These unsolved problems are attributed by several data and methodological gaps. For example, most existing studies are based on small-sample and single-centre cohorts. Temporal sequences and patterns of clinical events (eg, underlying conditions, comorbidities, diagnoses, ART-related visits and treatments) are understudied, which diminish the opportunities for understanding the aetiology of multifaceted HIV-associated comorbidities, their natural history and their interactions with the current coinfection. Critical data components such as adherence to HIV treatment, viral suppression, social determinants of health (SDOH), COVID-19 vaccination and sociobehavioural patterns (eg, substance use/dependence) are closely related to disparities in HIV and SARS-CoV-2 infections but are understudied in part due to the challenges in Electronic Health Records (EHR) data integration and phenotyping. EHR hold existing promise to provide full-spectrum and longitudinal clinical data, demographics and sociobehavioural data at the individual level. However, currently we do not have a comprehensive EHR-based cohort for individuals with HIV/SARS-CoV-2 coinfection; EHR integration and data mining tailored for studying the coinfection are urgently needed but are not yet developed.

The overarching goal of this exploratory/developmental study is to establish an EHR-based cohort for individuals with HIV/SARS-CoV-2 coinfection and perform large-scale EHR-based data mining to examine the interactions between HIV and SARS-CoV-2 infections and systematically identify and validate factors contributing to the severe clinical course of the coinfection. Ultimately, collected clinical evidence will be implemented and used to pilot test a clinical decision support (CDS) prototype to assist providers in screening and referral of at-risk patients in real-world clinics. We will approach this goal by pursuing the following tasks. First, we will extract comprehensive phenotypic traits (ie, clinical characteristics, demographics, sociobehavioural patterns) and their temporal series and patterns from structured and unstructured EHR—National COVID Cohort Collaborative (N3C).9 To extract and model temporal series and patterns of phenotypic traits, we will incorporate biomedical ontologies to develop a graphical model of EHR. Second, we will examine patterns and sequences of phenotypic traits for their predictive ability in clinical outcomes and disease prognosis. Major phenotypic traits to be examined include demographics, underlying conditions, comorbidities, CD4+ counts, viral suppression, ART procedures and medications, laboratory results for immune components and viral presence, treatments (eg, procedures and medications), SDOH, and sociobehavioural patterns. We will develop machine learning models to explore real-time predictive associations between these phenotypic traits and poor clinical outcomes and prognosis, including outcomes of the acute phase of COVID-19 and postacute sequelae of SARS-CoV-2 infection (PASC).10 Third, we will develop and pilot test a CDS prototype that delivers collected clinical evidence to providers through the Epic EHR system at Prisma Health. Predictive associations generated from the second task will be presented for providers to assist in screening patients at high risk of severe COVID-19 course. Outcomes to be measured include (1) the rate of identification and referral of individuals at high risk of poor clinical outcomes, (2) the rate of successful referral and clinical actions and (3) system usability. The proposed study protocol will result in (1) a comprehensive knowledge base that details risk factors of severe clinical outcomes and disease prognosis in individuals with HIV/SARS-CoV-2 coinfection and (2) a prototype CDS that can identify patients at high risk and provide actionable clinical decisions. This work will provide time-sensitive public health implications: clinical evidence for interactions between HIV and SARS-CoV-2 infections is desperately needed. This proposed EHR-based data mining offers a rapid and empirically grounded approach to collecting such evidence and to informing the design of prospective clinical trials that can focus on inflammatory pathways, biophysiological evidence of the coinfection and sociobehavioural determinants.

Methods and analysis

Data description

We will use EHR from N3C. As of 8/2022, N3C has aggregated 15.2 million patients (5.8 million COVID-19 patients) from 50 states.11 The EHR are normalised by the Observational Health Data Sciences and Informatics (OHDSI)’s Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM).12 EHR variables (individual level) span demographics, encounters, medical/social history, diagnoses, procedures, medication prescriptions, medication exposure (eg, vaccines), laboratory tests/results, etc. Clinical notes have already been annotated in the CDM.13 N3C has epidemiological and community data (population level) including temporal COVID-19 burden, vaccination, viral variance, health systems data and geospatial data. N3C has prepandemic data since 2018 and peripandemic data to the present. All data were deidentified and updated every 2 weeks. We have the highest level of data access, which allows using patients’ residential ZIP codes and dates of clinical events. As of August 2021, N3C has at least 13 000 adults have been diagnosed and/or have had laboratory-confirmed HIV. A pilot study shows that patients with coinfection have a higher risk of hospitalisation and mortality as compared with COVID-19 patients without HIV infection.14

EHR data modeling

We will remodel the EHR data extracted from N3C. Because individuals’ longitudinal health records are stored at distributed locations in EHR. Many clinical events do not have explicit and/or complete temporal information. Raw EHR data as such are of little value for understanding questions such as why individuals with certain coinfection present different clinical outcomes and disease prognosis.

We will locate relevant phenotypic traits from N3C by curating OMOP CDM concept sets, a procedure called electronic phenotyping. Using these concept sets, we will then retrieve and integrate individuals’ phenotypic traits from N3C. At last, we will retrieve temporal information for clinical events and establish a graphical model15 to represent clinical events and their temporal information.

EHR phenotyping

Phenotyping is the process of identifying cohorts and variables from raw EHR.16 Because we use OMOP CDM-normalised EHR, phenotyping is the process of finding ‘OMOP concepts’ that correspond to specific cohorts (eg, all patients with ART-related visits) or variables (eg, myocardial infarction).17 An OMOP concept is a unique identifier that is mapped to diverse medical codes that have the same semantic meaning but may be from different medical nomenclatures and EHR systems.

We will use standard phenotyping procedures. The logical procedures are as follows: (1) To identify existing OMOP concept sets (available in OHDSI’s Atlas system and N3C) that can be used with minor revision. (2) If no appropriate concept sets available, we will follow the generic phenotyping procedures16 18 and curate OMOP CDM concepts from the Athena vocabulary repository, which allows retrieval of OMOP CDM concepts.19 (3) To validate the revised or newly developed concept sets by using EHR chart review that is performed independently by two domain experts.16

Data retrieval and integration

To link external epidemiological and community-level data with individual-level EHR data, we will use individual patients’ residential ZIP/county as the reference. We will use data imputation algorithms to impute missing values because population-towards-individual data integration automatically creates missing values. For existing missing values in EHR, we will infer missing values based on semantic relationships of OMOP CDM concepts. For geographical locations, we will use cities, counties and states to infer locations at appropriate levels. For the rest of the missing values, we will use multiple imputation methods. Specifically, we will use selection models or pattern-mixture models for systematic missingness. We will also selectively use mean/median imputation, principal component analysis, singular value decomposition, k-nearest neighbour, least squares, expectation maximisation and random forest.20 21 Data retrieval and integration will be implemented using SQL, R, Python and PySpark, whichever appropriate.

Graphical EHR model

We will develop a customised graphical model to represent the temporal relations among patients’ demographic, clinical and sociobehavioural data. A Graphical model encodes clinical events as nodes and their semantic relations (including temporal relations) as edges. Figure 1 shows the proposed design of the model. General modelling procedures include: first, we will retrieve time stamps of clinical events. Some clinical events have explicit time stamps as captured by structured EHR. Many others do not (eg, patient-reported symptoms as documented in clinical notes, trimester and/or gestational age). For those that do not have explicit time stamps, we will infer the time of the event based on neighbouring EHR data. For example, symptom onsets could be found in clinical notes (eg, admission, discharge); trimester and gestational age could be estimated by gestation-related diagnoses (eg, Z3A: weeks of gestation), procedures (eg, ultrasound procedures) and clinical notes (eg, last menstrual period).22 23 Second, we will represent temporal information of clinical events by modelling the occurrences of an event and the instantaneous impact of the event. The occurrences of an event are resulted from the first step. The instantaneous impact of an event is formulated using exponential kernel functions and association rules based on clinical observation. Intuitively, two clinical events with a long interval in between would have less impact on one another, but this can be overwritten by events that hold special clinical meanings. Third, we will create recurrent states for every clinical event by embedding the event, its semantic relations (ie, edge in a Graph) including the instantaneous impact of an event, and time. These recurrent states will form a recurrent layer to be used for training machine learning models, which will be discussed later.

Figure 1

Electronic Health Records model design.

Machine learning modeling

We will use supervised machine learning to examine patterns and sequences of phenotypic traits for their predictive ability in clinical outcomes and disease prognosis. Existing studies conclude differently on clinical outcomes among individuals with coinfection as well as the factors correlated with these clinical outcomes. We provide two hypotheses. Hypothesis 1: individual patients respond differently to the coinfection. Hypothesis 2: patients’ clinical outcomes and disease prognosis are attributed by the temporal dynamics of clinical events. If these hypotheses are successfully tested, we will be able to delineate the impact of coinfection on individuals’ clinical outcomes. Therefore, we will customise recurrent neural network (RNN) models to be used for predicting clinical outcomes and disease prognosis in real time by learning about patients’ retrospective EHR at the individual level (personalised) as time progresses. We adopt RNN for its unique advantage in capturing temporal dependencies of data.24 Trained RNN models will be tested for their performance where the best-performed model will be identified for identifying patterns/sequences of phenotypic traits predictive of clinical outcomes and prognosis.


Based on the estimated >13 000 patients with the coinfection in our dataset, we will blend in controls (COVID-19 patients without HIV) for each output variable using a match ratio of 1:2, stratified by sex, race/ethnicity and age. For a possible occasion of small sample, for example, death cases (n<1000 with coinfection), the alternative strategy is to create synthetic cases to impute and balance the sample.

Machine learning input

A complete and longitudinal health history together with linked external epidemiological and community-level data will be included as the input of machine learning models by which the models can learn from the input to predict individuals’ in-time clinical outcomes and disease prognosis. We will include but are not limited to the following phenotypic traits: demographics, SDOH, diagnoses, underlying conditions, vitals, laboratory tests, procedures, medication prescriptions/dispensing, medication exposure (eg, vaccine) and annotated clinical notes. Because there is no gold standard measure for ART adherence, we will use ‘multiple measures’ to estimate levels of ART adherence.25 Multiple measures include medication events (inpatient dispensing), HIV-1 RNA copies (laboratory results) and medication adherence data from clinical notes. For those with complete medication adherence data, we use the proportion of days covered to categorise ART adherence levels (eg, <50%, 50%–80%, 80%–85%, 85%–90%, ≥90%).26 We will categorise antiviral medications into integrase inhibitor-based, non-nucleoside reverse transcriptase inhibitor-based, protease inhibitor-based and other regimens. The approach to measuring ART adherence using EHR has limitations, but the limitations can be mitigated by the well-presented and large-scale national data. We will collect both CD4+ counts as an indicator of existing damage and plasma HIV-1 RNA copies as an indicator of projected disease progression.

Machine learning output

Clinical outcome measures as machine learning output include inpatient admissions, length of stay (LOS), ICU admission, ICU LOS, comorbidities and primary discharge diagnosis.27 In addition to the measures within the acute phase of COVID-19, we will also use symptoms, diagnoses, comorbidities and PASC-associated readmissions as outcome and prognosis measures for individuals in the postacute phase.

Model design

We will use RNN as the machine learning architecture to learn from patients’ longitudinal EHR and make the prediction of current and future clinical outcomes and disease prognosis. We adopt RNN models because this neural network architecture is specialised for capturing temporal dependency among event sequences. The RNN models will be trained to learn from model input and to make the prediction of model output. With respect to the embedding, we will include standard long short-term memory (LSTM) as well as phased LSTM and other variants wherever appropriate.24 We will use the bag-of-pattern matrix as the baseline embedding method to be compared against LSTM, in which this baseline method does not fully consider temporal dependency. To test against RNN, we will use Support Vector Machine (SVM) as the benchmark algorithm, in which SVM is a well-performed kernel-based algorithm28 but does not take full advantage of temporal dependency (hypothesis 2) and personalised health records (hypothesis 1). Because the nature of our machine learning output is binary variables, the proposed machine learning tasks are essentially binary classification tasks. We will use Python for machine learning modelling.

Model evaluation (internal validity)

To test the effectiveness of the prediction model, we will use 10-fold cross validation. With respect to evaluation metrics, we will use the F score, precision, recall, and the area under the receiver operating characteristic (AUC) to assess the models’ predictive performance. We expect the F score, assuming balanced data, to reach a minimum of 0.8. If trained models fail to meet this expectation, alternative strategies include manually adding features handpicked by researchers after error analysis of models.

CDS system

Based on the automatic clinical outcomes and disease prognosis prediction model, we will design and implement a CDS prototype in collaboration with Prisma Health clinics. The proposed CDS prototype is anticipated to assist providers in screening and identifying patients who are at high risk of worse COVID-19 clinical outcomes (see table 1), and worse disease prognosis, including individuals with PASC. Specifically, the CDS will identify individuals with a high risk of disease progression from their current clinical state (eg, not hospitalised, hospitalised, postacute phase) by learning from the trained machine learning models. The effectiveness of the CDS demonstrates the external validity of the internally validated predictive model and will be assessed by (1) appropriate identification for at-risk individuals, (2) appropriate clinical actions and (3) CDS system usability.

Table 1

Patient state according to WHO clinical progression scale

CDS workflow

The proposed CDS is a hybrid of knowledge-based and non-knowledge-based system.29 It has (1) a machine learning-based prediction module (non-knowledge based) for identifying high-risk patients and (2) a provider-curated medical logic module (knowledge-based) for generating clinical actions for identified high-risk patients. The CDS testing takes place in a retrospective way (ie, using retrospective EHR).

Cohort definition and data collection

Using retrospective EHR data (2-year baseline~2023) from Prisma Health’s Epic system, we will first group the existing PLWH who have COVID-19 (sampling n>500) based on their state at the point of CDS screening. The patient states include (1) ambulatory patients with COVID-19, (2) hospitalised patients for COVID-19 with moderate disease, (3) hospitalised patients for COVID-19 with severe disease and (4) post-acute phase of COVID-19 (ie, from beyond 4 weeks after symptom onset).30 See table 1 for definitions of states 1–3 based on WHO’s clinical progression scale for COVID-19.

Prediction module

For patients in each state, we will use the trained machine learning model to learn from previous medical records and predict worsening clinical outcomes as time progresses (ie, acute, and postacute phases every 3 months. The prediction will include primary COVID-19 clinical outcomes (Box 1) developed by the WHO Working Group on the Clinical Characterisation and Management of COVID-19.31

Box 1

Key clinical outcome measures

Organ dysfunction

  • Murray score

  • Sequential organ failure assessment score, multiple organ dysfunction score

  • Acute coronary syndrome; arrhythmias

  • Delirium


  • Pulmonary, cardiovascular, renal, neurological, etc

Secondary infection

  • Bacterial, viral

Biochemical parameters

  • C reactive protein, D-dimers, IL-6, and ferritin serum concentrations, and leucocyte counts

Radiological findings

  • CT scan of the chest, X-ray of the chest

Duration of intervention

  • Inpatient admission, length of stay (LOS)

  • ICU admission, ICU LOS

  • Ventilation

  • Organ support or hospital-free days

Pregnancy outcomes

  • Preterm delivery, miscarriage

  • Fetal status

  • Severe maternal morbidity measures


  • All-cause mortality at hospital discharge

Quality of life

  • Longer-term survival and primary diagnoses for readmission (postacute phase)

Medical logic module

Patients identified by the CDS to have an increased risk of worse clinical outcomes will be reviewed and discussed by two providers who are specialised in HIV and COVID-19. First, the providers will generate gold-standard judgement on whether a patient is correctly identified by the prediction module, which later will be used for assessing the effectiveness of CDS. Second, the providers will generate appropriate clinical actions on chart review. These clinical actions will be made up to date with the ‘NIH Guidance for COVID-19 and People with HIV’, including treatment options based on cohorts and risk factors, medication reconciliation considering ART regimens, consultation with specialists for multiorgan system complications and PASC, referrals and outreach.32 Providers’ decision-making processes will be programmed using Arden Syntax (V.3) or Clinical Quality Language33 in the knowledge base, which is determined by specific Epic EHR data model.

Effectiveness of CDS (external validity)

There are two evaluation metrics: (1) appropriate identification for individuals at high risk for adverse clinical outcomes (Box 1) by comparing model-identified cases against the gold standard generated from chart review. We will use F measure (>0.8), AUC, precision and recall for assessment; (2) appropriate clinical actions using a quasiexperimental design. We will compare outcomes of patients who naturally used the medical logic module-suggested care against those who did not (n=100 each). The outcomes include but are not limited to readmissions (eg, same day, 7, 14, 30 days), healthcare utilisation (eg, LOS, emergency room (ER)/observation visits, ICU admission). We will use mixed-effect generalised regression models to estimate model effectiveness wherever appropriate.

Usability testing

We will assess CDS usability by adopting the ‘think aloud’ protocol.34 The two providers from Prisma Health will participate in the test. Each one will be presented with randomly selected EHR (n=5 at-risk cases+n=5 control cases) along with the CDS output. In each case, participants will be instructed to verbalise their reasoning procedures (eg, phenotypic traits from EHR that can be used in the reasoning, logic flow) towards identifying at-risk patients and corresponding clinical decisions. Sessions are audio recorded and will then be coded (eg, by content, understandability, navigation, workflow, visibility and usability) independently by two researchers for downstream analyses.

Patient and public involvement

No patient involved.

Ethics and dissemination

The study was approved by the institutional review boards at the University of South Carolina (Pro00121828) as non-human subject study.

This study will result in a comprehensive knowledge base that documents clinical outcomes and disease prognosis for individuals with the coinfection, their risk factors (eg, underlying conditions, ART adherence, comorbidities, sociobehavioural) and their responses to therapeutics. This study will also result in a prototype CDS that can identify patients at high risk of worsening clinical outcomes and prognosis in real time. These results are generalisable and will form a foundation for developing comprehensive real-world CDS systems for implementation in state-wide and national HIV and COVID-19 clinics.

Study findings will be presented at academic conferences and published in peer-reviewed journals. This study will disseminate urgently needed clinical evidence for guiding clinical practice for individuals with the coinfection at Prisma Health.

Ethics statements

Patient consent for publication



  • Contributors CL conceived the study design and drafted the manuscript. CL completed preliminary data collection. SW, BO, EG-CP, MY and XL contributed critical edits to the manuscript. All authors reviewed and approved the manuscript.

  • Funding Research reported in this publication was supported by the National Institute of Allergy And Infectious Diseases of the National Institutes of Health under Award Number R21AI170171. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; peer reviewed for ethical and funding approval prior to submission.