Predicting patient-level new-onset atrial fibrillation from population-based nationwide electronic health records: protocol of FIND-AF for developing a precision medicine prediction model using artificial intelligence

Introduction Atrial fibrillation (AF) is a major cardiovascular health problem: it is common, chronic and incurs substantial healthcare expenditure because of stroke. Oral anticoagulation reduces the risk of thromboembolic stroke in those at higher risk; but for a number of patients, stroke is the first manifestation of undetected AF. There is a rationale for the early diagnosis of AF, before the first complication occurs, but population-based screening is not recommended. Previous prediction models have been limited by their data sources and methodologies. An accurate model that uses existing routinely collected data is needed to inform clinicians of patient-level risk of AF, inform national screening policy and highlight predictors that may be amenable to primary prevention. Methods and analysis We will investigate the application of a range of deep learning techniques, including an adapted convolutional neural network, recurrent neural network and Transformer, on routinely collected primary care data to create a personalised model predicting the risk of new-onset AF over a range of time periods. The Clinical Practice Research Datalink (CPRD)-GOLD dataset will be used for derivation, and the CPRD-AURUM dataset will be used for external geographical validation. Both comprise a sizeable representative population and are linked at patient-level to secondary care databases. The performance of the deep learning models will be compared against classic machine learning and traditional statistical predictive modelling methods. We will only use risk factors accessible in primary care and endow the model with the ability to update risk prediction as it is presented with new data, to make the model more useful in clinical practice. Ethics and dissemination Permissions for CPRD-GOLD and CPRD-AURUM datasets were obtained from CPRD (ref no: 19_076). The CPRD ethical approval committee approved the study. The results will be submitted as a research paper for publication to a peer-reviewed journal and presented at peer-reviewed conferences. Trial registration details A systematic review to incorporate within the overall project was registered on PROSPERO (registration number CRD42021245093). The study was registered on ClinicalTrials.gov (NCT04657900).

ABSTRACT Introduction Atrial fibrillation (AF) is a major cardiovascular health problem: it is common, chronic and incurs substantial healthcare expenditure because of stroke. Oral anticoagulation reduces the risk of thromboembolic stroke in those at higher risk; but for a number of patients, stroke is the first manifestation of undetected AF. There is a rationale for the early diagnosis of AF, before the first complication occurs, but populationbased screening is not recommended. Previous prediction models have been limited by their data sources and methodologies. An accurate model that uses existing routinely collected data is needed to inform clinicians of patient-level risk of AF, inform national screening policy and highlight predictors that may be amenable to primary prevention. Methods and analysis We will investigate the application of a range of deep learning techniques, including an adapted convolutional neural network, recurrent neural network and Transformer, on routinely collected primary care data to create a personalised model predicting the risk of new-onset AF over a range of time periods. The Clinical Practice Research Datalink (CPRD)-GOLD dataset will be used for derivation, and the CPRD-AURUM dataset will be used for external geographical validation. Both comprise a sizeable representative population and are linked at patient-level to secondary care databases. The performance of the deep learning models will be compared against classic machine learning and traditional statistical predictive modelling methods. We will only use risk factors accessible in primary care and endow the model with the ability to update risk prediction as it is presented with new data, to make the model more useful in clinical practice. Ethics and dissemination Permissions for CPRD-GOLD and CPRD-AURUM datasets were obtained from CPRD (ref no: 19_076). The CPRD ethical approval committee approved the study. The results will be submitted as a research paper for publication to a peer-reviewed journal and presented at peer-reviewed conferences. Trial registration details A systematic review to incorporate within the overall project was registered on INTRODUCTION Atrial fibrillation (AF) is the most common sustained cardiac arrhythmia. The current estimated prevalence is between 2% and 4%, and a 2.3-fold rise is expected due to extended longevity in the general population. 1 While AF may present with symptomatic palpitations, for many patients, the first diagnosis of AF is only after they present with stroke or cardiac decompensation. The frequency of AF in ischaemic strokes is 20%-30%, and these are usually severe, resulting in permanent disability or fatality. 2 Strengths and limitations of this study ► Large and nationwide dataset representative of the UK primary care population. ► Using artificial intelligence technology may discover new predictive variables by efficiently incorporating temporal information of event data. ► The model will not just predict the risk of a patient developing atrial fibrillation, but also provide a representation of how risk develops over time to enable more focused screening. ► The derivation and validation work will be undertaken in datasets collected over the same time period in the UK; therefore, further validation work may be pursued with newly collected data and for international contexts. ► The derivation data will not include unstructured natural language free text; future research could explore if incorporating free text into representation learning improves predictive accuracy.

Open access
Oral anticoagulants can reduce the risk of stroke by up to two-thirds in those with AF at higher risk of stroke. 3 International guidelines recommend that patients with AF at elevated thromboembolic risk are offered stroke prophylaxis with an oral anticoagulant. 1 Most patients with AF will have stroke risk factors, making them eligible for an oral anticoagulant. Many will have concomitant cardiovascular disease (such as hypertension, valvular heart disease or heart failure) making them eligible for further investigation or treatment. Equally, in those with AF who are at low risk of stroke (and therefore do not qualify for oral anticoagulation), surveillance for increasing stroke risk is advisable.
Thus, the early diagnosis of AF, before the manifestation of the first complication, remains a major public health challenge. Screening for AF in the community has been proposed as an approach to optimise early AF detection. 4 Previous studies have shown that the risk of AF (often asymptomatic) increases with age. 5 Opportunistic screening is cost-effective in patients aged 65 years or over, and among individuals aged 75-76 years old undergoing a 2-week intermittent ECG screening. [6][7][8] Nevertheless, there is no current recommendation for UK populationwide systematic screening. 9 Prediction models could contribute to AF screening by discriminating patients into risk categories, from which investigation intensity could be planned. 10 However, models based solely on analysis of investigations may not apply in the community setting-for example, routine ECGs are not always available. 11 To date, several multivariable prediction models have been created or tested for prediction of incident AF in the community. The earliest models were derived from structured follow-up of prospective cohorts, including Framingham Heart Study score for Atrial Fibrillation and CHARGE-AF (Cohorts for Heart and Ageing Research in Genomic Epidemiology for Atrial Fibrillation). 12 13 The proliferation of electronic health records (EHRs) has led to the development of several models from local registries, including Maccabi Healthcare Services and C 2 HEST (Coronary artery disease/chronic obstructive pulmonary disease, Hypertension, Elderly, Systolic heart failure, Thyroid disease). 14 15 Structured EHRs offer larger sample sizes to assess more candidate variables and derive more generalisable models.
A systematic review found that the models derived from prospective cohorts had been more frequently externally validated. CHARGE-AF was the only model that showed significant overall discrimination in meta-analysis but its transportability to EHRs is still being investigated. 16 17 Each model to date is, however, limited by one or more of their use of small, geographically remote or historical datasets, lack of temporal information, crude risk modelling with consequent suboptimal model performance and/or predictor variables not readily available in primary care. 18 None has yet reached widespread clinical practice.
Machine learning is a data-driven approach that can identify non-linear associations and complex interactions between variables without the need to specify these relationships a priori. 19 A recent study applying this methodology to a nationwide UK dataset produced a model with a greater discriminative capability than CHARGE-AF (area under receiver operating characteristic (AUC) 0.827 vs 0.725) in EHR. 19 Artificial intelligence (AI) has several desirable features for prediction modelling from EHRs. It facilitates the use of vast quantities of event data and associated temporal information, handles many predictors with automatic variable selection techniques, accommodates nonlinearities and interactions among variables, and enables a live learning approach (whereby the prediction model is automatically updated). A range of AI techniques have been applied to EHR data and have demonstrated better prediction power over traditional statistical approaches. 20 Furthermore deep learning, a subfield of machine learning, can learn complex patterns from data to characterise higher level correlations among clinical events. 21 Accordingly, it may derive robust patient representations from raw EHR data for prediction modelling without the need for manual, expert-dependent feature engineering like classic machine learning techniques, which places a limit on scalability and generalisability. 22 Using AI to develop a predictive algorithm from routinely collected primary care EHRs could offer several advantages: 1. A model could predict the risk that a person will develop new-onset AF, and how that evolves over time, whereas current prediction models only provide a fixed prediction horizon. This would allow phenotypespecific and temporal-specific screening which could make screening more efficacious and cost-effective. 2. A model created from routinely collected EHRs could be more smoothly translated into clinical practice by being embedded into existing clinical EHRs. 3. The predictive magnitude of variables for the development of AF may identify novel risk markers for AF, which could then be studied for causality.

Research aim
The aims of the Future Innovations in Novel Detection of Atrial Fibrillation (FIND-AF) study are to: 1. Develop a deep learning model for predicting the risk, and evolution of the risk, of new-onset AF in primary care. 2. Identify and quantify the magnitude of risk markers of new AF among routinely collected primary care data. 3. Externally validate the model in a geographically distinct dataset to assess generalisability.

METHODS AND ANALYSIS Data sources and permissions
The derivation dataset for training and testing the model will be the Clinical Practice Research Datalink-GOLD (CPRD-GOLD) dataset. This is an ongoing primary care database, established in 1987, that comprises anonymised medical records and prescribing data contributed by general practices using Vision software. It contains data for approximately 17.5 million patients, with 30% of contributing practices in England, and represents the UK population in terms of age, sex and ethnicity. 23 In order to contribute to the database, general practices and other health centres must meet prespecified standards for research-quality data ('up-to-standard'). 23 24 To ascertain whether the prediction model is generalisable, we will externally validate its performance in the geographically distinct CPRD-AURUM dataset. This was launched in 2017 and encompasses only practices using EMIS Web software. It contains data for approximately 26.9 million patients and draws on data collected from practices in England only. 25 Any practices which previously contributed to CPRD-GOLD have been removed from the CPRD-GOLD cohort to ensure that these datasets reflect different populations. CPRD undertakes various levels of validation and quality assurance on the daily general ptactice data collection comprising over 900 checks covering the integrity, structure and format of the data. 25 Recorded information in both datasets includes patients' demography, clinical symptoms, signs, investigations, diagnoses, prescriptions, referrals, behavioural factors and test results entered by clinicians and other practice staff. All clinical information is coded using Read Codes in CPRD-GOLD and SNOMED clinical terms (CT) in CPRD-AURUM. 26 27 In the proposed study, extracted patients will have patient-level data linked to Hospital Episode Statistics (HES) Admitted Patient Care (APC) and Diagnostic Imaging Dataset (DID), Office for National Statistics (ONS) Death Registration, patient-level deprivation and practice-level deprivation to provide a more comprehensive dataset. The CPRD dataset has been used to develop or validate a range of risk prediction models, including in cardiovascular disease. 19 28 The extracted datasets, including linked data, comprise all patients for the period between 2 January 1998 and 30 November 2018 from the snapshot of CPRD-GOLD and CPRD-AURUM provided in October 2019. Over this study period, the CPRD-GOLD dataset comprises approximately 4.5 million patients eligible for data linkage at an up-to-standard practice, with over 200 000 patients having a record of AF. The CPRD-AURUM dataset comprises approximately 18 million patients eligible for data linkage, with almost 800 000 patients having a record of AF. A sample of 245 general practices will be randomly selected from 800 general practices in CPRD-AURUM to approximately match the size of CPRD-GOLD.

Patient and public involvement
Patients and the public were not involved in the design of this research. However, a Scientific Advisory Board, including representatives from the Arrhythmia Alliance, National Institute for Health and Care Excellence AF guideline committee lay members (last updated 04 January 2020) and EHR software providers, has been created to provide expert context advice on the research, the dissemination of results and advise on the translation of the findings of this study into clinical practice.

Inclusion and exclusion criteria
The study population will comprise all available patients in CPRD-GOLD and CPRD-AURUM eligible for data linkage and with at least 1-year follow-up in the period between 2 January 1998 and 30 November 2018. Patients will be excluded if they were 18 years of age or under at the date of the first registration in CPRD, diagnosed with AF or atrial flutter (AFl) before 1 January 1998, registered for less than 1 year in CPRD or ineligible for data linkage.

Outcome ascertainment
The outcome of interest is first diagnosed AF or AFl after baseline (1 January 2009). We have included AFl as an outcome since it has similar clinical relevance, including thromboembolic risk and anticoagulation guidelines, as AF. 1 These will be identified using Read codes and SNOMED CT in CPRD datasets. For HES APC events and underlying cause of death variable in the ONS Death Registration data file, the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD-10) codes will be used. Misclassified data can lead to systematic prediction errors and accuracy of data may vary over time. 29 CPRD has converted older ICD codes to the newer version, increasing confidence in their validity. Nonetheless, to verify data accuracy, we will check the data accuracy by year and include the year of AF diagnosis in the prediction models to assess their impact.

Sample size
To develop a prognostic prediction model, the required sample size may be determined by three criteria suggested by Riley et al. 30 For example, suppose a maximum of 200 parameters will be included in the prediction model and the Cox-Snell generalised R 2 is assumed to be 0.01. A total of 377 996 patients will be required to meet Riley's criterion (1) with global shrinkage factor of 0.95; this sample size also ensures a small absolute difference (Δ<0.05) in the apparent and adjusted Nagelkerke R 2 (Riley's criterion (2)) and ensures precise estimate of overall risk with a margin of error <0.001 (Riley's criterion (3)). According to the Quality and Outcomes Framework, the prevalence of AF in England is 1.7%. 31 32 Given an AF prevalence of 1.7%, only 6425 patients will be expected to develop AF from 377 996 patients. Therefore, the number of patients in the CPRD dataset with AF will provide sufficient statistical power to develop and validate a deep learning prediction algorithm with the predefined precision and accuracy.

Predictor variables
A systematic review has highlighted 22 predictor variables included in varying combinations by 10 preceding prediction models developed to detect incident AF in the community. 16 In addition, a more recently published machine learning model has established six further Open access time-varying variables (eg, change in body mass index between the latest two quarters). 19 To capture the full potential of deep learning and this large dataset, we will broaden our search for candidate predictors to all available variables, while retaining temporal information (including all clinical assessments, hospitalised events and medications). The potential predictors may include the following: 1. Sociodemographic variables including age, sex, ethnicity and indices of multiple deprivation. 2. All disease conditions during follow-up, including hospitalised diseases and procedures, such as other cardiovascular diseases, diabetes mellitus, chronic lung disease, renal disease, inflammatory disease, cancer, hypothyroidism and surgical procedures. 3. Clinical assessments including heart rate, systolic and diastolic blood pressure, height, weight and body mass index. 4. Medications prescribed including antihypertensives, statins, antidepressants, anxiolytics/hypnotics and antipsychotics. 5. Lifestyle factors including smoking status and alcohol consumption. 6. All biomarkers routinely collected during follow-up including total, high-density lipoprotein and low-density lipoprotein cholesterol, triglycerides, creatinine, C reactive protein, erythrocyte sedimentation rate. Predictive factors will be identified using the appropriate codes. In CPRD, we will use Read codes for diagnoses, measurements (eg, systolic and diastolic values) and product codes (prodcodes) for medications. In HES APC, we will use ICD-10 codes and Office of Population Censuses and Surveys Classification of Interventions and Procedures version 4 (OPCS-4) codes. In the ONS Death Registration data file, we will use ICD-10 codes (and ICD-9 codes for the period before 2001).

Missing data
Missing data are expected in EHR data, and will be handled using multiple imputations according to the approaches suggested by Carpenter and Kenward, and depending on the amount of missing data. 33 It is likely there will be misclassification of baseline characteristics, such as smoking status, cholesterol, blood pressure and weight that change over time. To account for this, we will define baseline information using only measures recorded within the last year. We will then use measures taken within the last 2-5 years and in the year after baseline as parameters in the chained imputation equations used to impute baseline covariates with ≤40% missingness. In addition, median imputation, a common approach to deal with missing values in machine learning algorithms, will be used to test the model robustness. 34 Data analysis plan Development and external validation of model The CPRD-GOLD and CPRD-AURUM data will be cleaned and preprocessed for model development and validation, respectively. Specifically, for patient features with binary values, 0 and 1 will be mapped to the binary values. Variables with multiple categories will be split into their component categories, and each given a binary value to indicate the presence or not of the variable for each patient. Continuous variables will be kept as continuous. To reduce the high cardinality of Read codes and ICD codes, we will map both to Caliber codes, 35 which is an expert checked mapping directory from University College London. Prodcodes can be mapped to level 2 of British National Formulary codes. 36 A number of deep learning models will be investigated for prediction of AF in CPRD-GOLD. These will include a convolutional neural network (CNN), a recurrent neural network (RNN) and a Transformer architecture. They each possess characteristics that can capture the temporality of EHR data and the progression of a patient's health status.
Although CNNs are typically associated with static content (eg, images and documents), they are increasingly used to uncover temporal relationships in EHR. [37][38][39][40] The patient EHR can be converted to a temporal matrix, with one dimension corresponding to time and the other dimension corresponding to medical events ( figure 1A). 37 A one-sided convolution operation can be applied to each possible window of features in the event matrix to produce a feature map followed with max pooling (to capture the most important features) and culminating in a fully connected layer and softmax classifier. If the time dimension is embedded in 1-day increments, we have the option to learn temporal features by extending connectivity in the time dimension through a range of fusion strategies. Temporal early fusion could combine information across an entire time window to establish global patterns, temporal late fusion can be performed in the fully connected layer to model local connections and temporal slow fusion could capture both by extending the connectivity of all convolutional layers. 37 Patient EHR events including diagnoses, procedures and medications can also be represented as a sequence of codes over time, similar to words in a sentence ( figure 1B). After embedding into a lower dimensional space, a one-dimensional convolutional operation over the temporal dimension with a combination of filters of different lengths could capture temporal dependencies at multiple levels. 38 A sequence-based representation also makes patient EHR amenable to techniques that have provided breakthroughs in natural language processing, especially RNNs. [41][42][43] RNNs with a gated recurrent unit model design have been successfully applied in modelling sequential structured EHR data to predict diagnoses as they can accept an input vector at each time step while storing information in a hidden layer which changes over time. 21 44-51 We will apply the reverse time attention model (RETAIN), which has been tested for the prediction of heart failure in CPRD. 52 RETAIN can generate temporal attention vectors at both visit and variable level Open access by running RNNs backwards, which somewhat mimics a physician's behaviours in examining a patient's past visits in reverse time order and could capture the short-term time-varying events that were found to improve prediction in a previous shallow neural network model. 19 46 Timestamps can be included to calculate the attentions for the entire visit sequence and thus provide temporal information for each visit. 46 Finally, Transformers have become state of the art in natural language processing, 53 54 and we will apply an adaption for EHR, BEHRT (BERT [Bidirectional Encoder Representations from Transformers] for EHR), which was originally developed in CPRD. 55 This model's structure, depicting each diagnosis code as a word, with each visit as a sentence and the entire medical history as a document, facilitates multihead self-attention, positional encoding and masked language model for EHR. BEHRT can define a representation incorporating information on the diseases afflicting a patient, the positional interactions of diseases and sequence of events, with the age of a patient linked to each visit to give a sense of time between diagnoses. 55 This representation can then be used for a number of tasks including disease prediction.
Preprocessed patient-level data in CPRD-GOLD will be randomly split into an 80:20 ratio to create training and testing samples. The split ratio is not a significant factor, given the volume of the sample size. The model parameters and dropout rate will be chosen through a grid search and 10-fold cross-validation will be used (ie, 10% of the training data will be randomly selected as the cross-validation set).
While the outcome is binary, instead of just predicting 0 or 1 for a patient (new AF or not), the probability of that patient developing AF by different time points over months (1,3,6) and years (1, 5, 10) may be predicted. The probability of developing AF at each time point could be plotted to give an understanding of evolving risk of AF. The clinical risk prediction performance of the deep learning models will be compared against a range of classic machine learning techniques and traditional statistical predictive modelling methods including support vector machine, random forest, naïve Bayes and Cox proportional hazards model. Discrimination (Harrell's c-statistic and AUC) and calibration metrics will be supplemented with positive and negative predictive values, precision, recall and area under precisionrecall curve for all models.
The CPRD-AURUM dataset will then externally validate the model performance to assess generalisability. It has been shown that a lack of external validation has hampered the implementation of previous machine learning models in routine clinical practice. 56 Identification and quantification of the magnitude of predictors The proposed deep learning models can extract informative risk factors from EHR data. Specifically, a risk factor selection strategy proposed by Huang et al 57 will be adapted to identify informative risk factors. The model will provide weights of the identified risk factors to help understand the significance of risk factors at different risk levels. The impact of the number of risk factors on AF risk prediction performance will be assessed through the curves of both AUC and prediction accuracy plotted against the number of risk factors. Some predictors, such as body mass index, blood pressure, frequency of visits and strength of the prescribed medication, may change over time. The incremental prognostic values of including these variable trajectories will be explored, and the impact on predictive accuracy will be assessed.

Software
The deep learning model will be implemented in R (through the R studio interface) and TensorFlow or Python and PyTorch including data preprocessing, missing data imputation, model development and validation.

Ethics and dissemination
The study has been approved by CPRD (ref no: 19_076). Those handling data have completed University of Figure 1 An example of how a patient's EHR could be represented as a temporal matrix (A) compared with a sequence (B). In (A), time is on the x dimension and medical events are on the y dimension. In (B), the temporal information, in this example, is represented as intervisit interval through timestamps (eg, t2-t1). EHR, electronic health record.

Open access
Leeds information security training. All analyses will be conducted in concordance with the CPRD study dataset agreement between the Secretary of State for Health and Social Care and the University of Leeds.
The study is informed by the Prognosis Research Strategy (PROGRESS) framework and recommendations. 58 The subsequent research paper will be submitted for publication in a peer-reviewed journal and will be written following Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) and REporting of studies Conducted using Observational Routinely collected health Data (RECORD) guidelines. 59 60 If the model succeeds (defined as improving predicting accuracy by at least 5% compared with existing models), the algorithm could be made readily available through free-to-use software. The model will be designed to be amenable to in situ updating with new information so that prediction of an individual's AF risk is updated contemporaneously. The algorithm could be a built-in tool for use in general practices to 'screen' for patients at high risk of developing new-onset AF. Future research will be needed to assess the clinical impact of this risk model. At the point when utilisation in clinical practice is possible, the applicable regulation on medicine devices will be adhered to. 61 When in clinical use, the model itself could also be reviewed and updated by a prespecified expert consensus group on an annual basis after incorporating evidence from post-service utilisation and the curation of more data.

CONCLUSIONS
AF is a common clinical problem with potentially catastrophic sequelae. A prediction model that may identify in a community setting which individuals will develop AF, and when this is most likely to occur, could enable targeted screening. This British Heart Foundation-funded study will contribute to knowledge about the detection of AF through prediction using a data science approach in routine EHR data. The use of AI technology may uncover new predictors in EHR and facilitate easier translation into clinical practice.