Determining the feasibility of calculating pancreatic cancer risk scores for people with new-onset diabetes in primary care (DEFEND PRIME): study protocol

Introduction Worldwide, pancreatic cancer has a poor prognosis. Early diagnosis may improve survival by enabling curative treatment. Statistical and machine learning diagnostic prediction models using risk factors such as patient demographics and blood tests are being developed for clinical use to improve early diagnosis. One example is the Enriching New-onset Diabetes for Pancreatic Cancer (ENDPAC) model, which employs patients’ age, blood glucose and weight changes to provide pancreatic cancer risk scores. These values are routinely collected in primary care in the UK. Primary care’s central role in cancer diagnosis makes it an ideal setting to implement ENDPAC but it has yet to be used in clinical settings. This study aims to determine the feasibility of applying ENDPAC to data held by UK primary care practices. Methods and analysis This will be a multicentre observational study with a cohort design, determining the feasibility of applying ENDPAC in UK primary care. We will develop software to search, extract and process anonymised data from 20 primary care providers’ electronic patient record management systems on participants aged 50+ years, with a glycated haemoglobin (HbA1c) test result of ≥48 mmol/mol (6.5%) and no previous abnormal HbA1c results. Software to calculate ENDPAC scores will be developed, and descriptive statistics used to summarise the cohort’s demographics and assess data quality. Findings will inform the development of a future UK clinical trial to test ENDPAC’s effectiveness for the early detection of pancreatic cancer. Ethics and dissemination This project has been reviewed by the University of Surrey University Ethics Committee and received a favourable ethical opinion (FHMS 22-23151 EGA). Study findings will be presented at scientific meetings and published in international peer-reviewed journals. Participating primary care practices, clinical leads and policy makers will be provided with summaries of the findings.


Introduction
Worldwide, pancreatic cancer has a poor prognosis.Earlier diagnosis may improve survival by enabling curative treatment.However, non-specific symptoms and a lack of suitable biomarkers make this challenging.Statistical and machine learning prediction models using risk factors such as patient demographics, symptoms and blood tests are being developed for clinical use to improve earlier diagnosis.One example is the Enriching New-onset Diabetes for Pancreatic Cancer (ENDPAC) model, which employs patients' age, blood glucose and weight changes to provide pancreatic cancer risk scores.However, ENDPAC has yet to be used in clinical settings.In the United Kingdom (UK), blood tests and weight measurements are routinely collected in primary care which, given primary care's central role in assessing and addressing patients' cancer risk, makes it an ideal setting to assess ENDPAC's feasibility.

Methods and analysis
DEFEND PRIME will be a multi-centre observational study determining the feasibility of extracting data from primary care providers in the UK to calculate ENDPAC scores.After developing the data extraction methods, 20 UK GP practices will provide anonymised data on participants aged 50+ years, with a glycated haemoglobin (HbA1c) test result of ≥ 48 mmol/mol and no previous abnormal HbA1c results.ENDPAC scores will be calculated, and descriptive statistics used to summarise the cohort's demographics and assess data quality.Findings will inform the development of a future clinical study, in which ENDPAC scores will be calculated and participants with elevated scores will be invited for clinical workup.

Ethics and dissemination
.

Introduction Pancreatic cancer and early diagnosis
Pancreatic cancer is the seventh leading cause of global cancer deaths, with only 10-20 % of patients diagnosed at a sufficiently early stage for curative intervention.(1,2) Survival can be dramatically improved if diagnosed earlier, at a local rather than distant stage -37 % versus 3 % 5-year survival rate respectively.(3,4)However, there are multiple barriers to early diagnosis including the nonspecific nature of the early symptoms, (5) and lack of suitable diagnostic biomarkers, although advances are being made in this area.(6)(7)(8)(9)(10)(11)(12) As with other health conditions including cancers of other sites,(13) statistical and machine learning clinical prediction models are being developed for clinical use, to facilitate earlier diagnosis of pancreatic cancer, in particular its most common subtype, pancreatic ductal adenocarcinoma.(3,14,15) These models range widely in complexity, (16)(17)(18)(19)(20) with the simplest models including only a few variables that can also be routinely collected in primary care, making them potentially feasible for use in this setting.

Primary care's role
In most developed countries, primary care constitutes a key element of healthcare provision; in the UK, 90 % of contacts with the UK's National Health Service (NHS) are through primary care.(21) Primary care providers including general practitioners (GPs), are expected to play a central role in assessing and addressing patients' cancer risk.(22) However, it is estimated that GPs see only one new case of pancreatic cancer every five years (23) and, when combined with the non-specific nature of its early symptoms, detection can be very difficult.Clinical prediction models are therefore of real potential value for these clinicians, especially given these challenges of diagnosing pancreatic cancer in the context of their busy work schedules.

Enriching New-Onset Diabetes for Pancreatic Cancer (ENDPAC) model
The simplicity of the ENDPAC model makes it ideally suited for use in primary care as it only uses patient age, weight change and blood glucose measurements, which are routinely collected.(20,24,25) The model is based on the well-documented association of pancreatic cancer with older age and the paradoxical development of diabetes with weight loss.(26)(27)(28)(29)(30)(31)(32)(33)(34) It also captures the more rapid onset of glycaemic dysregulation found in pancreatic cancer-related diabetes than found in type 2 diabetes.Due to the clinical diagnosis of diabetes sometimes occurring months or even years after diabetes onset, (35)(36)(37) ENDPAC instead uses the glycaemic onset to avoid these potential delays and thereby maximises new-onset diabetes' potential for the early diagnosis of pancreatic cancer.(20) The model has undergone external validation in two separate studies in the United States, which established that ENDPAC demonstrates a reasonable ability to differentiate patients with type 2 diabetes from those with glycaemia-defined diabetes who later develop pancreatic cancer.(24,25) It is for these reasons that this study will investigate ENDPAC's feasibility for use in UK primary care settings.

ENDPAC scores
ENDPAC calculates risk scores that patients have pancreatic cancer, by using their age and changes over time to their weight and blood glucose results.According to the model's developers, a score ≤ 0 has a sufficiently high negative predictive value for pancreatic cancer that those with this score can be deemed as only needing management for type 2 diabetes, given their very low risk of pancreatic cancer.A score ≥ 3 is considered to warrant clinical workup for pancreatic cancer.(20) This is because in the original development study and two subsequent external validation studies, patients with a score ≥ 3 had, respectively, a 3.6 %, 2.0 % and 2.6 % 3-year risk of pancreatic cancer, with sensitivities of 78 %, 63 % and 42 %.(20,25,24) The reduced performance in the external validation studies is unsurprising, as performance is often lower when models are applied to different populations than those used to build the model.(38)Furthermore, Sharma et al. (20) suggest that with sufficient additional case review processes, 50 % of false positives can be removed, increasing the 3-year risk of pancreatic cancer for patients with a score ≥ 3 from 3.6 % to 10 %.

Rationale for this feasibility study
Before ENDPAC can be used in clinical practice in the UK, it is important to ascertain in what form the required information to calculate ENDPAC scores will be available in UK primary care records, as this is presently unclear.This study will address this, in addition to other pertinent challenges, such as how the data required to calculate ENDPAC scores can be extracted from the multiple different patient record management software platforms used in UK primary care.(39)Preliminary exploratory work with the stakeholder GPs involved in this study has demonstrated that the two primary systems in use accounting for 85 % of the national market share,(39) EMIS Web by EMIS Health and SystmOne by The Phoenix Partnership (TPP), require the development of different approaches to extract the data needed to calculate ENDPAC scores.The reason for developing the extraction methods centrally and providing these for the GP practice staff to use is threefold: first, to maximise the number of GP practices able to calculate ENDPAC scores as few are likely to have staff specialised in writing data extraction software; second, to minimise the time spent by GP practice staff in extracting the data given their already busy work schedules; and third, to ensure consistency in the extraction methods used by the GP practices, so that there can be confidence in the results from each GP practice.
In addition, as values such as weight are often captured opportunistically, it may be that the required data are unavailable for many patients.Furthermore, the units of measurement used for certain variables such as height, weight and HbA1c may differ between practices, as may data management processes, potentially leading to differing data quality levels.The availability of the data and its quality will directly impact whether ENDPAC scores can be calculated, and therefore this information is of critical importance to learn in advance of considering using ENDPAC in clinical practice.This study will address these potential issues and will establish whether it is feasible to calculate ENDPAC scores using routine data contained within electronic healthcare records in UK primary care. .

Study aim and objectives
The aim is to determine the feasibility of calculating ENDPAC scores for people with new-onset diabetes in UK GP practices.
The objectives are to: 1. Develop data extraction methods for primary care.We will work with software developers and GP practice staff who specialise in primary care computer systems to develop open source software and documented search strategies to enable data extraction.2. Extract anonymised data for eligible participants from 20 GP practices and evaluate the quality and availability of data.3. Calculate risk scores and undertake descriptive data analysis.We will report the number of people with ENDPAC scores warranting referral for pancreatic cancer investigations and their clinical and demographic characteristics.

Design and setting
Determining the feasibility of calculating pancreatic cancer risk scores for people with new-onset diabetes in primary care (DEFEND PRIME) is a multi-centre cohort study.We will extract anonymised data for people with new-onset diabetes from 20 GP practices in the UK.We will analyse the demographics of the cohort, their ENDPAC scores, and assess the quality and availability of data.

GP practice recruitment
We will use several recruitment strategies: • Presentations and networking at conferences and meetings attended by clinicians and academics working in the early detection field.• Newsletters sent by the Pancreatic Cancer Action charity, the Surrey and Sussex Cancer Alliance, the National Institute for Health and Care Research Clinical Research Network and the Royal College of General Practitioners.• Advertising on social media channels to increase study awareness.
• Dissemination of study information by stakeholders and colleagues through their professional networks.
Practices will enrol by completing a data sharing agreement.Based on an hourly rate of £50 for an estimated seven hours' work to extract the data, each practice will be reimbursed £350.

Participant eligibility
For their data to be included, participants must be at least 50 years old and with new-onset diabetes identified between 1st January 2020 to 31st December 2022.For the purposes of this study, newonset diabetes will be defined by an abnormal glycaemic test result of HbA1c ≥48 mmol/mol.All prior HbA1c test results for the participants must be below this level.

Data extraction methods
We will develop the data extraction methods with professional software developers and staff in the GP practices who specialise in creating searches in patient record management systems.The extraction methods developed will include software and detailed instructions to enable practice staff to perform the data extraction.
. Table 1 details the data that will be extracted, which is modified from Sharma et al. (20) Prior to transfer to the research team, HbA1c, weight and BMI results will be selected according to the priorities defined by Sharma et al. (20) from the results available at the multiple defined timepoints for these variables shown in table 1, with any excess results removed.
The final extract file containing anonymised data will then be securely transferred to the University of Surrey and stored on secure research drives accessible only by the research team.Participants will not be identified or contacted during this study.

Carcinoembryonic antigen (CEA) result
Earliest result in participant's history -State whether below or above threshold value and if before or after index date.*The term 'child' refers to the relationship between the main code (the 'parent' code) and its related ('child') codes, and does not relate to the age or relationship of the participants.

ENPAC score calculation
For participants with the required results, ENDPAC scores will be calculated according to the process defined by ENDPAC's developers, using HbA1c mmol/mol results equivalent to the original calculator's fasting blood glucose and estimated average glucose results.(20)

Data analysis
Table 2 shows the descriptive statistics that will be used to describe the demographics of the cohort, including counts with percentages, means with standard deviations (SD) and medians with inter quartile ranges (IQR).We will assume that data are missing at random and therefore results will be calculated using the available data.The proportion of missing data will be described using counts with percentages.We will provide counts with percentages of participants for whom HbA1c, weight, BMI and height results are available and specifying for which timepoints.For the participants meeting the study's inclusion criteria, we will be able to assess the availability of data for the variables listed through the analysis summarised in Table 2.We will report on the number of cases for whom ENDPAC scores can be calculated per practice, in addition to the distribution of the scores into high-(≥3), intermediate-(2-1) or low-risk (≤0) groups for pancreatic cancer at the time they first meet the glycaemic definition of new-onset diabetes, (20) and include the timepoints from which the HbA1c and weight results were taken.This will enable us to provide estimates on the number of patients who would need clinical workup for pancreatic cancer if the ENDPAC model were to be deployed across the UK, thereby assessing the potential resource burden on the NHS.

Project governance
The study will be overseen by a steering group of GPs from the Surrey and Sussex Cancer Alliance.They will meet with the study team every two months to discuss progress.The steering group and other stakeholders, including the NHS Cancer Programme strategy team, the charitable patient advocate group Pancreatic Cancer Action (PCA) and a pancreatic cancer survivor, have already been and will continue to be involved throughout the study, providing advice and guidance on study design, recruitment, and dissemination strategies.

Patient and public involvement
PCA has a well-established patient and public involvement group.Their expertise and feedback will be incorporated throughout this study and they will also support study dissemination, including publication writing and seminars.

Discussion
For ENDPAC to be of benefit to patients in the UK, the availability and quality of the required data and the means to extract it from primary care providers needs to be established, and this will be the first study the authors are aware of to do so in a UK setting.In achieving this, it will be the first to develop scalable semi-automated methods for data extraction to obtain the results required to calculate ENDPAC scores from UK primary care patient record management software platforms.In addition, the extraction methods will enable GP practice staff to validate a portion of the extracted data prior to transfer to the research team, thereby providing confidence in the findings.Through assessing the availability and quality of the data, the feasibility of rolling-out ENDPAC in UK primary care can be established, and the resource impact on the NHS estimated, based on the number of participants warranting clinical workup through sufficiently high ENDPAC scores.
Preliminary exploratory analysis undertaken by the research team of coded diabetes diagnoses in primary care records indicates that on average, approximately 30-40 patients per practice in the UK are diagnosed with new-onset diabetes annually.As the study covers three years, and as we .are defining new-onset diabetes solely using HbA1c results rather than relying on coded diagnoses, it is estimated that each participating GP practice will provide data for at least 100 participants.Therefore, with 20 practices participating, approximately 2000 participants' records will be provided.Only anonymised data will be extracted and analysed, meaning that individual participants' ENDPAC scores will not be reported.
The quality of routine data presents a challenge in any data-driven study.For example, weight, BMI and HbA1c measurements are opportunistically collected in clinical practice, and therefore are not necessarily available at regular time intervals.Through discussion with the study's steering group, even though weight and height are needed to calculate BMI, patient record management systems do not always require the underlying values at that timepoint to be entered when recording or calculating BMI.As ENDPAC requires weight results for score calculation, we are extracting BMI and height values in addition to weight, to enable back-calculation of weight results if only BMI and height values are provided at the required timepoints.This will maximise the number of eligible participants for whom an ENDPAC score can be calculated.We will provide feedback to the practices if any particular issues are encountered with missing results in their practices, and how they might improve on this.
In this study we will use single, unpaired HbA1c results, whilst ENDPAC was originally developed for use with paired results from a combination of fasting blood glucose, random average glucose, HbA1c or oral glucose load test results.(20) This is because both external validation studies reported that participants had substantially more HbA1c results than other blood glucose measurements, and recommended that ENDPAC be applied in a real world setting to those diagnosed with diabetes through HbA1c only.(24,25) In addition, Khan et al.'s external validation successfully used single HbA1c results, rather than requiring paired results.(24) Furthermore, the UK stakeholder GPs involved in planning the current study have highlighted that HbA1c is their preferred means of assessing patients' blood glucose, and is the principal method for diabetes monitoring in the UK.(40) It is for these reasons that HbA1c results will be used in this study.
The three-year time window in use by this study is based on the significantly increased risk of pancreatic cancer in the three years after diagnosis of new-onset diabetes.(20,41,42) Sharma et al. (20) suggest that with sufficient additional case review processes for those having ENDPAC scores calculated, 50 % of false positives can be removed, increasing the 3-year risk of pancreatic cancer for patients with an ENDPAC score ≥ 3 from 3.6 % to 10 %.This process includes reviewing patients' records for other causes of weight loss, recent steroid use causing rapid blood glucose increases, and uncontrolled diabetes causing rapid weight gain pre-index and rapid weight loss post-index.For the purposes of this feasibility study, such additional case review processes are not considered necessary for inclusion within the data extraction process.This is because depending on the outcome of this feasibility study, we plan to design and deliver a clinical intervention collaborating with patients and clinicians, aiming to improve early diagnosis by using ENDPAC scores.In the future study, after clinical consultation involving manual case review by the participants' GPs to assess each participant's suitability for participation, participants with an elevated ENDPAC score will be invited for further investigations, such as blood tests and pancreatic scans, to rule out or diagnose pancreatic cancer.
Data extracts created as part of this project will remain under the management of GP practices.Data will not be made open access or deposited in any repository, as outlined in the data sharing agreement.Subject to all necessary approvals, data may be made available for secondary use by the GP practices who remain data controllers.
Results will be presented at scientific meetings and published in international peer-reviewed journals.Summaries will be provided to the participating GP practices, clinical leads, and policy makers.