Article Text

Original research
Investigating the optimal handling of uncertain pregnancy episodes in the CPRD GOLD Pregnancy Register: a methodological study using UK primary care data
  1. Jennifer Campbell1,2,
  2. Krishnan Bhaskaran2,
  3. Sara Thomas3,
  4. Rachael Williams1,
  5. Helen I McDonald3,
  6. Caroline Minassian2
  1. 1Clinical Practice Research Datalink, Medicines and Healthcare Products Regulatory Agency, London, UK
  2. 2Faculty of Epidemiology and Population Health, Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
  3. 3Faculty of Epidemiology and Population Health, Department of Infectious Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
  1. Correspondence to Jennifer Campbell; jennifer.campbell{at}mhra.gov.uk

Abstract

Objectives To investigate why episodes of pregnancy identified from electronic health records may be incomplete or conflicting (overlapping), and provide guidance on how to handle them.

Setting Pregnancy Register generated from the Clinical Practice Research Datalink (CPRD) GOLD UK primary care database.

Participants Female patients with at least one pregnancy episode in the Register (01 January 1937−31 December 2017) which had no recorded outcome or conflicted with another episode.

Design We identified multiple scenarios potentially explaining why uncertain episodes occur. Criteria were established and systematically applied to determine whether episodes had evidence of each scenario. Linked Hospital Episode Statistics were used to identify pregnancy events not captured in primary care.

Results Of 5.8 million pregnancy episodes in the Register, 932 604 (16%) had no recorded outcome, and 478 341 (8.5%) conflicted with another episode (251 026 distinct conflicting pairs of episodes among 210 593 women). 826 146 (89%) of the episodes without outcome recorded in primary care and 215 577 (86%) of the conflicting pairs were consistent with one or more of our proposed scenarios. For 689 737 (74%) episodes with recorded outcome missing and 215 544 (86%) of the conflicting pairs (at least one episode), supportive evidence (eg, antenatal records, linked hospital records) suggested they were true and current pregnancies. Furthermore, 516 818 (55 %) and 160 936 (64%), respectively, were during research quality follow-up time. For a sizeable proportion of uncertain episode, there is evidence to suggest that historical outcomes being recorded by the general practitioner during an ongoing pregnancy may offer explanation (73 208 (29.2%) and 349 874 (37.5%)).

Conclusions This work provides insight to users of the CPRD Pregnancy Register on why uncertain pregnancy episodes exist and indicates that most of these episodes are likely to be real pregnancies. Guidance is given to help researchers consider whether to include/exclude uncertain pregnancies from their studies, and how to tailor approaches to minimise underestimation and bias.

  • maternal medicine
  • public health
  • epidemiology

Data availability statement

Data may be obtained from a third party and are not publicly available. The data used for this study were obtained from the Clinical Practice Research Datalink (CPRD). All data are available via an application to CPRD’s Research Data Governance (RDG) Process (see https://www.cprd.com/research-applications). Data acquisition is associated with a fee and subject to ethics approval.

https://creativecommons.org/licenses/by/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This work carefully examines the way in which pregnancies are recorded in electronic health data in order to maximise its usefulness for pregnancy research.

  • Detailed scenarios were developed as to why uncertain pregnancy episodes may occur along with criteria which researchers can apply to ascertain which episodes may fit each scenario.

  • Clinician advice and clinical guidelines were used to generate assumptions as to why and when clinicians may record information relating to pregnancy; however, these may not be correct in every case.

  • Electronic health data are not collected for the purposes of research and can be messy for a variety of reasons, some of which may not have been captured in this study.

Introduction

Understanding how diseases, drugs and other exposures affect pregnant women and their children is an important public health priority. However, pregnant women are excluded from many trials due to potential risks to the woman and her unborn child. Observational research using electronic healthcare records (EHRs) has thus become a well-established vital tool for investigating disease prevalence, risk factors and pharmacovigilance in pregnant women. UK primary care databases are particularly useful due to the gate-keeper healthcare system meaning all antenatal care is overseen by a general practitioner (GP).1 One example of such a database is CPRD GOLD. This database is produced and maintained by the Clinical Practice Research Datalink (CPRD), a government research service collecting de-identified and fully coded patient-level EHR from primary care practices across the UK.2 However, challenges such as incomplete data capture in EHR data can make it difficult to identify accurately the start and end of pregnancies. Recently, a collaboration between CPRD and the London School of Hygiene and Tropical Medicine established a Pregnancy Register of all pregnancies in CPRD GOLD3 which includes approximately 6 million estimated pregnancies (henceforth, pregnancies in the Register will be referred to as pregnancy episodes).

Previous approaches to generating pregnancy registers have been limited by the exclusion of pregnancies without identified outcomes and pregnancy records which do not fit chronologically into an identified pregnancy episode.4 Ignoring these records potentially excludes periods when women were pregnant. If these pregnancies systematically differ from those captured more completely, their exclusion may lead to bias. For example, pregnancies ending in miscarriage may be less likely to have the outcome recorded than pregnancies ending in live birth.3 Ignoring pregnancy data which are challenging to interpret may therefore underestimate adverse outcomes. Incomplete capture of pregnancies also impacts descriptive studies that need pregnancies as denominator data, such as vaccine uptake studies. A further limitation of previous approaches is that some women have pregnancies that seemingly overlap in the data, and these are not addressed. These conflicting pregnancies highlight that estimated timings of some pregnancies may be suboptimal and/or some pregnancy episodes may not be true pregnancies. Approaches which exclude incongruent or incomplete pregnancy data may lead to misclassification of exposure timings.

The unique advantage of the CPRD Pregnancy Register is that it uses all pregnancy data in CPRD GOLD, thereby capturing all documented pregnancies regardless of completeness. However, this also presents interpretational challenges: approximately 950 000 pregnancy episodes (16% of all pregnancy episodes) have no outcome recorded and approximately 500 000 pregnancy episodes conflict with another episode for the same woman (episodes identified by the algorithm with at least 1 day of overlap). These episodes are flagged in the Register enabling researchers to identify them when designing their study. However, there may be multiple reasons for the occurrence of uncertain episodes and therefore absolute rules on whether to include or exclude them from a study may be inappropriate.

We therefore aimed to investigate possible reasons why the algorithm used to generate the CPRD Pregnancy Register identifies uncertain episodes and thus generate information to guide future use of this important resource. Our specific objectives were:

  1. To identify potential scenarios which may result in pregnancy episodes without a recorded outcome or those which conflict with another episode for the same woman.

  2. To use available data (including linked data) to investigate these potential scenarios and flag pregnancy episodes which are consistent with each one.

  3. To provide information to researchers using the Register to help inform their decisions on how to handle these uncertain episodes when designing studies.

Methods

Data sources

CPRD primary care data and the Pregnancy Register

The CPRD GOLD UK primary care database contains registration information and all care events that general practice staff record to support clinical care. This includes demographic information (birth year, sex, etc), clinical events (signs, symptoms, medical diagnoses), referrals to specialists and secondary care, prescriptions issued in primary care, vaccinations, test results, lifestyle information (eg, smoking status) and other care administered as part of GP practice.5 CPRD data also contain indicators of data quality at the patient level (known as the acceptability flag; online supplemental appendix 1) and at the practice level (known as the practice up-to-standard (UTS) date; online supplemental appendix 1). As CPRD GOLD is a longitudinal database, updated monthly, it contains variables indicating whether the patient and practice are still contributing data.

The Pregnancy Register lists and characterises all pregnancies identified in CPRD GOLD based on an algorithm.3 A single record represents a unique pregnancy episode. Each woman may have multiple episodes. Information includes the estimated start and end of pregnancy, its outcome (when recorded) and whether it was a singleton or multiple pregnancy. For live birth pregnancies, patient identifiers of linked babies identified through the CPRD Mother-Baby-Link6 are provided. Figure 1 gives an overview of the algorithm steps, including how gestational ages were applied, and online supplemental appendix 2 gives a list of the variables provided in the Register. Figure in online supplemental appendix 3 shows an example of how a real pregnancy might manifest in (a) raw CPRD gold data and (b) the processed Pregnancy Register dataset.

Figure 1

Pregnancy register algorithm steps used to create the CPRD Pregnancy Register. CPRD, Clinical Practice Research Datalink.

Linked data

Person-level linkage of CPRD primary care data with other datasets (eg, Hospital Episode Statistics HES) is available for English practices who have consented to participate in the linkage scheme.7 These linkages cover approximately ~56% of contributing CPRD GOLD practices in the UK. Where available, we used linked data to look for further information about the pregnancy episodes within the Register. HES APC (Admitted Patient Care) data include information on admission and discharge dates, diagnoses, specialists seen and procedures undertaken for linked patients with a hospitalisation record.8 We searched HES APC data for records of pregnancy outcomes using International Classification of Diseases (ICD-10) and Operating Procedure Codes (OPCS) (online supplemental appendices 4 and 5). HES APC maternity records were also used: a recording of an acceptable value in any of the variables identified as relating to delivery (online supplemental appendix 6) was taken as evidence that a delivery had taken place.

The HES Diagnostic Imaging Dataset (DID) provides detailed information about diagnostic imaging tests, including X-rays, MRI scans and fetal growth scans, taken from National Health Service (NHS) providers' radiological information systems. This was used for records of fetal scans. Office for National Statistics (ONS) mortality data were also used to ascertain additional death records which may have been missing from CPRD.

We used set 17 of the CPRD linked data for which the coverage periods were: HES APC 01 April 1997–31 July 2017; HES DID 01 April 2012–31 July 2017; ONS Mortality Data 02 January 1998–19 September 2017.

Study population

This study included all individuals who had at least one pregnancy episode without a recorded outcome or at least one conflicting pregnancy episode in the February 2018 version of the Pregnancy Register. All pregnancy records for these patients were extracted from the CPRD GOLD database using the pregnancy code-list upon which the pregnancy algorithm is based,3 thereby creating a dataset which included all pregnancy records and the summary Pregnancy Register information for these women. Women were followed up until the minimum of leaving the practice, death or practice last collection date. In the linked data analysis, women with HES records beyond this point were followed up until the end of linked data coverage.

Identifying scenarios to explain the occurrence of uncertain episodes

Potential scenarios which may result in uncertain pregnancy episodes, including those without recorded outcomes and those which conflicted with another episode, were identified through discussions with the creators of the Register (CM, ST, RW), clinicians and CPRD data experts. The scenarios are based on the structure of the CPRD GOLD data and the Pregnancy Register algorithm (figure 1, steps 1–8). The scenarios are not mutually exclusive; thus, episodes may be consistent with more than one scenario.

Pregnancy episodes with recorded outcome missing

Scenarios with the potential to result in episodes with missing outcomes were identified. There are four overarching problems with various specific scenarios within them: the pregnancies are true and current, but the outcome was not captured in CPRD primary care data; the pregnancies are true and current, but the pregnancy was still ongoing at the end of follow-up in the database; the patient was not pregnant at the time of the database record; the pregnancy is really part of another pregnancy episode in the Register. The 12 scenarios which fall under these problems are described in table 1.

Table 1

Description of potential scenarios leading to pregnancy episodes with no recorded outcome and scenario criteria applied

Conflicting pregnancy episodes

Scenarios with the potential to result in conflicting episodes were proposed and are described in detail in table 2. Identifying the scenarios was an iterative process, after applying initial scenarios we took a sample of 50 conflicting pregnancy episodes and reviewed the patient data. This allowed us to validate existing scenarios and identify further scenarios. Scenarios can be grouped under four overarching problems: both pregnancies are true but one is a historical pregnancy; both pregnancies are historical; both pregnancies are true and current but the gestation of the second pregnancy estimated by the algorithm is too long; the woman was pregnant, but one pregnancy has been split into multiple episodes by the rules of the algorithm (online supplemental appendix 3).

Table 2

Description of potential scenarios leading to conflicting episodes and scenario criteria applied

Applying criteria to identify evidence of each scenario

Evidence in HES

For each episode, it was ascertained whether the woman was eligible for linkage to other data and whether the episode occurred within the coverage period of each linked data source. For pregnancy episodes occurring within the linkage coverage period, the linked HES data were examined for evidence of pregnancy outcomes. The period for which outcomes were searched was from the episode start date to 9 months after the episode end date; we excluded from this analysis pregnancies where this period was entirely outside the coverage dates for linked HES data.

ICD-10 and OPCS code lists were used to look for evidence of outcomes in the HES APC Episodes, Diagnosis and Procedures tables (online supplemental appendices 4 and 5). In the HES APC maternity data, a recording of an acceptable value in any of the variables identified as relating to delivery (online supplemental appendix 6) was flagged as evidence that a delivery had taken place. In the HES outpatient data, an ICD-10 code list for evidence of delivery, termination or early pregnancy loss was used. Snomed codes (online supplemental appendix 14) were used to identify all fetal scan records in the HES DID data.

Pregnancy episodes with recorded outcome missing

All episodes coded as outcome unknown (‘13’ in the outcome field) were extracted from the Pregnancy Register. For each episode, we extracted information on the timing of the episode in relation to the start and end of patient follow-up and the period of research standard (UTS) data recording in CPRD, and we also searched for relevant codes in the patient’s record, namely: early pregnancy codes which were likely to be recorded in the patient’s first antenatal visits to the GP; codes which are likely to be recorded by the GP as clinically important in the patient’s medical history even when the patient was not pregnant; codes which may indicate an outcome but were originally classified by the Register as antenatal; codes which are likely to be recorded by the GP as part of a consultation about the potential health impacts on a patient of becoming pregnant (code lists in online supplemental appendices 7–9).

For each scenario, a set of criteria based on how these should appear in the data were established (described in detail in table 1). Criteria were systematically applied to the data to establish which episodes were consistent with each scenario.

Conflicting pregnancy episodes

All conflicting episodes (those with at least 1 day of overlap with another episode for the same woman) were ascertained using the conflict flag in the Register. Pregnancy episodes may conflict with more than one other episode. Each conflicting pair was treated separately and therefore an individual pregnancy episode could appear in the analysis multiple times. A dataset was created which contained one row per pair of conflicting pregnancy episodes.

Episodes were ordered by start date with episode one being the earlier start date of the two. Descriptive variables were added to the dataset from the CPRD GOLD data to indicate if the episodes were during current registration and UTS follow-up. Pregnancy episode outcomes were grouped into three categories: delivery, loss or missing, and a variable was generated to indicate the combination of outcomes in each conflicting pair (online supplemental appendix 12).

For each scenario, a set of criteria based on how these should appear in the data were established (described in detail in table 2). Criteria were systematically applied to the data to establish which conflicting pairs were consistent with each scenario.

Patient and public involvement

There was no patient or public involvement in this methodological work.

Results

There were 2 438 493 women with a pregnancy episode in the February 2018 version of the Pregnancy Register; of these patients, 731 368 (30%) had at least one uncertain episode. Mean patient follow-up time for all women was 4720 days, this was slightly lower for women with a missing outcome record (4349 days) (table 2). Women with an uncertain episode were more likely to be over 30 years of age. Uncertain pregnancy episodes were also more likely to be recent (after 2000) (table 2).

Pregnancy episodes with recorded outcome missing

Of the 5.8 million pregnancy episodes in the Pregnancy Register, there were 932 604 (16%) episodes with no recorded outcome of which over half (516 818, 55.4%) were during UTS follow-up and current registration (table 3). A total of 826 146 (89%) had evidence consistent with at least one of the identified scenarios (table 4). On the other hand, 689 737 (74%) had evidence of a scenario indicating they were true (either current or historical) pregnancies (scenarios 1a, 1b, 1c, 2a, 2b or 4e). The largest proportion of pregnancy episodes occurred before the patient registered at their current practice which contributed the data to CPRD or before that practice was deemed to be contributing research standard data (415 807, 44.6% scenario 1c). A total of 211 070 (22.6%) episodes had data in HES consistent with the outcome occurring in hospital and not being fed back to the GP (scenario 1a), representing approximately 50% of episodes with recorded outcome missing which were eligible for linkage. HES APC data were the most useful linked data source for ascertaining pregnancy outcomes with a small number found in HES outpatient (online supplemental appendix 15).

Table 3

Baseline characteristics of the pregnancy episodes in the February 2018 Pregnancy Register

Table 4

Numbers of pregnancy episodes with recorded outcome missing which were consistent with applied criteria for each scenario*

The second most common potential explanation for pregnancies without outcome was scenario 4d, where a code relating to the patient’s pregnancy history may have been recorded by the GP while the patient was pregnant. A total of 349 874 (37.5%) episodes without outcome were consistent with this scenario. Relatively fewer episodes were consistent with scenario 4a, 4b and 4e, none were consistent with 4c. For 242 698 (26%) episodes, follow-up ended before the predicted end of the pregnancy (scenario 2a and 2b) for 822 episodes (<0.1%) of these episodes follow-up ended due to death. Only small proportions of episodes were consistent with other scenarios. The distribution of scenarios that occurred during the period left censored by the practice UTS date and patient current registration date was similar to that of the Pregnancy Register as a whole (table 4, online supplemental appendix 16).

Conflicting pregnancy episodes

There were 478 341 (8.5%) pregnancy episodes with a conflict recorded in the February 2018 Pregnancy Register, amounting to 251 026 conflicting pregnancy pairs. Over half of the pairs (160 936, 64%) were during UTS follow-up and current registration. There were 215 577 (88.6%) pairs which were consistent with at least one identified scenario. Of the remaining 106 458 (11.4%), less than half were during UTS follow-up and current registration (table showing these pregnancies by scenario is given in online supplemental appendix 17). Across all scenarios, at least 40% were during UTS follow-up and current registration. Of the pregnancy pairs, 215 544 (86%) had evidence of a scenario indicating that at least one episode was a true and current pregnancy (scenarios 1a, 1b, 3a, 3b and 4a–e). Most conflicting pairs had at least one pregnancy episode ending in loss (201 783, 80.3%) (online supplemental appendix 18). Furthermore, 41% (101 760) of pairs included at least one pregnancy with no outcome recorded.

A total of 75 672 (30%) of all conflicting pairs were shown to have evidence that they were consistent with problem 1, that a patient had a record relating to the outcome of a previous pregnancy recorded during a current pregnancy. This includes scenario 1b: a record of a previous loss recorded during a pregnancy ending in delivery or vice-versa, one of the most common scenarios (29% of conflicting pairs) (table 5).

Table 5

Numbers of conflicting pregnancy episodes which were consistent with applied criteria for each scenario*

A total of 73 191 (29%) of pairs were consistent with scenario 4e: that adjusting of pregnancy dates by the algorithm had led to unassigned records. Of these, over 96% (70 472) were consistent with this scenario only, and 73% (53 464) of these pairs had a linked baby identified. A total of 43 581 (17.4%) of episodes had evidence that they were consistent with further antenatal information having been recorded after the end of pregnancy (scenario 4b).

For approximately 16% (39,373) of conflicting pairs, there was evidence to suggest that the gestation of the second pregnancy episode specified by the algorithm may have been too long leading to an overlap (scenario 3a and 3b).

Ten per cent of conflicting pairs had a loss and delivery recorded on the same date and no ‘current pregnancy’ antenatal codes suggesting they may have been recorded as part of an obstetric history (scenario 2a). Only small percentages of episodes were consistent with other scenarios. Proportional distribution of the scenarios was similar when restricted to those recorded during UTS and current registration to that of the whole Pregnancy Register.

Discussion

This work has shown that uncertain pregnancy episodes in the CPRD Pregnancy Register can contain valuable information about a woman’s pregnancy. A high proportion of the uncertain episodes were during research quality follow-up time and therefore comprise data which would usually be included in study designs.9 We have systematically identified potential reasons for the existence of uncertain episodes within the pregnancy register to allow researchers to consider in more detail whether inclusion is appropriate for their study. This work adds further value to the CPRD Pregnancy Register which is already unique in its inclusion of all pregnancy data regardless of completion.3 4 To our knowledge, no previous studies have attempted to examine uncertain pregnancies in EHR data in this way and many of the scenarios we have described will also be applicable to other EHR data sources.

We found that most episodes with a missing outcome could be explained by the outcomes not being captured in the CPRD GOLD primary care database; either the patient was not registered at the time of the pregnancy, the outcome was not recorded by the GP but could be found in linked data, or follow-up ended before the outcome. These are likely to be genuine and contemporaneous pregnancies which would be missed if episodes with recorded outcome missing were excluded from the Register. In fact, most of the scenarios we identified are consistent with the episodes being true and current pregnancies. When conducting drug utilisation or vaccine uptake studies, researchers may wish to include episodes where the database follow-up ended before the outcome to avoid underestimation especially for new drugs or vaccination programmes. Further to our objective to provide guidance, table 6 outlines potential considerations for researchers deciding whether to include or exclude uncertain episodes from their study.

Table 6

Issues with different approaches to dealing with uncertain episodes and recommendations

There is evidence to suggest that historical outcomes being recorded by the GP during an ongoing pregnancy may explain a sizeable proportion of the uncertain episodes generated by the algorithm. This can lead to true pregnancies being split by the algorithm and depending on the timing, this will either generate an additional episode with outcome missing or two separate episodes with outcomes (figure 1, step 3). In either case, the resulting episodes may conflict with one another. Based on our findings, this appears to be something that happens fairly frequently. One concern is that these episodes are likely to appear more frequently for women with a history of complicated pregnancy outcomes. For example, previous caesarean sections may be likely to be noted by the GP during current care as would outcomes such as ectopic pregnancies. Researchers should be aware that exclusion of women who have overlapping pregnancies for this reason might therefore systematically exclude those with a history of pregnancy complications, introducing bias.

It is also possible that current pregnancies with serious complications are more likely to have an uncertain episode in the Register. For example, women with pre-eclampsia are more likely to have consultant-led antenatal care carried out in hospital, increasing the chances that their primary care record is incomplete and has no recorded outcome.10 This data pattern is likely to result in the pregnancy being split into multiple episodes without outcome (figure 1, step 8). Dropping all uncertain episodes at the study design stage may mean that these patients are missed. Researchers who are interested in specific pregnancy complications should take this into consideration and use a tailored approach when selecting a study population.

While some conflicting episodes may be caused by poor quality data, there are many conflicting episodes for which it may be possible to clarify which time period is likely to be the true pregnancy. We found that episode conflicts were more likely to occur for pregnancies ending in loss; this is of little surprise given the wider variation around the true gestation of such pregnancies.11 There was also a large overlap between the conflicting episodes and those that were missing an outcome. Again, this is not surprising as the start and end dates for the missing outcome episodes have large margins of error, given they are often estimated based on one or two antenatal codes (figure 1, step 8).3 Not including uncertain episodes may lead to underascertainment of miscarriage as an outcome. However, including them all may lead to exposure status misclassification due to mistimed start and end dates or past pregnancy outcomes being counted.

Researchers may consider using multiple imputation to handle missing outcomes. However, there is a strong likelihood that the pattern of missing pregnancy outcomes is not missing at random and both multiple imputation and listwise deletion could result in biased results. Investigation of the linked HES data has shown that using these additional data alongside the Register could help users to identify many missing outcomes.7 8 12 Potentially useful pregnancy outcome data were found in multiple places across the HES APC database (NHS Digital, 2021). Identifying outcomes in HES could allow users of the Register to adjust the dates of the pregnancy episodes. While HES data are useful as a complementary source of information, it is also an EHR database derived from data that were not collected for research purposes and there may be gaps in recording. It is, however, less likely that pregnancy outcome events which happen in hospital will be recorded retrospectively and therefore dates of recorded outcomes may be considered more reliable.

Furthermore, using the HES DID data to access antenatal scan records offers a useful way to validate the dates of primary care pregnancy episodes as patients are unlikely to have an antenatal scan when they are not currently pregnant.13 When using linked data, we recommend that the study population be restricted to those patients in the Pregnancy Register who are eligible for linkage.

The main limitation of this work is that it relies on the assumption that real-life scenarios will consistently result in the same data patterns. EHR data such as CPRD GOLD are not collected for the purposes of research and can be messy for a variety of reasons. As the criteria we applied to identify our proposed scenarios may not have been a true fit to each pregnancy episode, this may have resulted in misclassification of the true underlying cause. While we did validate a random sample of pregnancy episodes by looking at the individual Read codes recorded, it was not possible to look at every episode in detail. Furthermore, some of our scenarios relied on assumptions as to why and when GPs may record clinical information relating to pregnancy. While this was informed by clinician advice and clinical guidelines, it may not be correct in every case. There is also the possibility that there are other scenarios which we did not identify, and special cases of scenarios that we could not test. For example, since 2007, women in the UK have been given the option of accessing midwife-led care directly. While information about the pregnancy should be fed to their GP, this may not always be the case. A survey report by the Quality Care Commission published in 2020 estimated that in 2018, 47% of women accessed antenatal care directly through a midwife.14 As yet, no routinely linked data allow for investigation of this special case of scenario 1a.

We have described in detail reasons why uncertain pregnancy episodes may occur in the CPRD Pregnancy Register and criteria which researchers can apply to ascertain which episodes may fit each scenario. This work offers researchers the opportunity to tailor their study to accommodate these episodes where appropriate (table 6).

Conclusions

This work has shown evidence that most uncertain pregnancy episodes are consistent with true and current pregnancies for which the data contain valuable information. It is important that researchers carefully consider the impact of including or excluding these episodes from their study. We have demonstrated that examining patterns of events within the primary care data or looking for further evidence in linked data can help to identify possible explanations. Here we offer users of the Pregnancy Register an insight into why these episodes exist and guidance on how to tailor their study population accordingly.

Data availability statement

Data may be obtained from a third party and are not publicly available. The data used for this study were obtained from the Clinical Practice Research Datalink (CPRD). All data are available via an application to CPRD’s Research Data Governance (RDG) Process (see https://www.cprd.com/research-applications). Data acquisition is associated with a fee and subject to ethics approval.

Ethics statements

Patient consent for publication

Ethics approval

This study involves human participants and was approved by the Independent Scientific Advisory Committee (ISAC) for Medicines and Healthcare Products Regulatory Agency Database Research (protocol no: 17_285R2 and 19_140) and the London School of Hygiene and Tropical Medicine Ethics Committee. This study uses de-identified electronic health records only.

Acknowledgments

This work uses data provided by patients and collected by the NHS as part of their care and support.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors JC, KB, ST, RW, HIM and CM contributed to the initiation, planning and design of the study. JC performed the analysis. KB, ST, RW and CM conducted study supervision. HIM and ST provided clinical input. JC wrote the manuscript with KB, ST, RW, HIM and CM performing critical revision. JC is acting as guarantor for this work

  • Funding This work forms part of JC’s PhD which is funded by CPRD (grant number N/A). KB is funded by a Wellcome Senior Research Fellowship (220283/Z/20/Z). CM was supported by a UKRI Innovation Fellowship at Health Data Research UK London (MR/S003932/1). HIM and ST were funded by the National Institute for Health Research (NIHR) Health Protection Research Unit (HPRU) in Immunisation (IS-HPU1112-10096) at the London School of Hygiene and Tropical Medicine in partnership with Public Health England (PHE).

  • Disclaimer The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, the Department of Health and Social Care, or PHE.

  • Competing interests JC and RW are employees of CPRD. There are no other conflicts of interest to report.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.