Linkage of maternity hospital episode statistics birth records to birth registration and notification records for births in England 2005–2006: quality assurance of linkage

Objectives The objectives of this study were to describe the methods used to assess the quality of linkage between records of babies’ birth registration and hospital birth records, and to evaluate the potential bias that may be introduced because of these methods. Design/setting Data from the civil registration and the notification of births previously linked by the Office for National Statistics (ONS) had been further linked to birth records from the Hospital Episode Statistics (HES) for babies born in England. We developed a deterministic, six-stage algorithm to assess the quality of this linkage. Participants All 1 170 790 live, singleton births, occurring in National Health Service hospitals in England between 1 January 2005 and 31 December 2006. Primary outcome measure The primary outcome was the number of successful links between ONS birth records and HES birth records. Rates of successful linkage were calculated for the cohort and the characteristics associated with unsuccessful linkage were identified. Results Approximately 92% (1 074 572) of the birth registration records were successfully linked with a HES birth record. Data quality and completeness were somewhat poorer in HES birth records compared with linked birth registration and birth notification records. The quality assurance algorithms identified 1456 incorrect linkages (<1%). Compared with the linked dataset, birth records were more likely to be unlinked if babies were of white ethnic origin; born to unmarried mothers; born in East England, London, North West England or the West Midlands; or born in March. Conclusions It is possible to link administrative datasets to create large cohorts, allowing researchers to explore important questions about exposures and childhood outcomes. Missing data, coding errors and inconsistencies mean it is important that the quality of linkage is assessed prior to analysis.


INTRODUCTION
The use of routinely collected datasets within research has increased rapidly over the past decade as an alternative to conducting large, observational studies, which can be extremely costly and often suffer from poor recruitment and retention rates. 1 While there are many advantages to using linked administrative data for medical research, they also present challenges. One is the quality of data recorded and, in consequence, the quality of data linkage, as this is highly reliant on the availability and accuracy of personal identifiers 2 and other supporting information.
The Digital Economy Act 2017 3 was introduced with the aim of facilitating data sharing for research purposes, but only if the data have been 'deidentified' and the research is deemed to be in the public interest. 4 Without access to personal identifiers, successful linkage between datasets becomes more Strengths and limitations of this study ► This is the first study to evaluate the linkage of birth registration, birth notification and Hospital Episode Statistics (HES) birth records in England. ► We were able to build on existing work relating to the quality assurance of birth registration, birth notification and HES delivery records. ► Access to personal identifiers meant that we were able to evaluate the quality of linkage, identify poor quality matches and assess the level of bias introduced by linking these datasets. ► Accessing personal identifiers also led to delays because of the length of approval processes and the need to travel to access these data in designated secure settings. However, the Office for National Statistics is developing a remote access system for the secure setting, so future research studies may not be affected by the need to travel. ► We examined only singleton births in National Health Service hospitals so we did not assess whether the linkage of multiple births is of comparable quality. However, previous work linking birth registration records with mothers' delivery records in multiples suggests that the linkage rate is similar to that for singletons.

Open access
challenging. Therefore, the 'Trusted Third Party' model, whereby the full identifiers are transferred to an organisation, which will link them with its own data and return the linked data to the data controller, is now the preferred method of linkage in most research projects. While Trusted Third Parties typically publish their linkage algorithms, they usually do not publish results of quality assurance (QA) of their methods. Therefore, it is essential that researchers assess the quality of linkage and validity of data prior to conducting statistical analyses. The work described in this paper was conducted as part of the (Tracking the Impact of Gestational Age on Health, Educational and Economic outcomes: a Longitudinal Records Linkage Study) TIGAR study, which is a population-based, record-linkage study of births and hospital admission data in England. The study aimed to estimate the association between gestational age at birth and rates of hospital admission throughout childhood. This used previously linked data from two sources. 5 The first of these was civil registration of births, a legal process in which parents register the birth and provide mainly demographic information to a specially trained local registrar of births, deaths and marriages who records the information, issues birth certificates and forwards the data to national systems. The second is the data recorded at the notification of the birth within 36 hours to the National Health Service (NHS) by the attending midwife or other birth attendant, and the baby's NHS Number, a unique identifier, is allocated. These combined Office for National Statistics (ONS) birth records were linked with birth records from Hospital Episode Statistics (HES), the national hospital discharge system for England. This was done by the data owner, the Health and Social Care Information Centre, now known as NHS Digital, as part of an earlier, larger study by City, University of London. 6 Two HES records are generated when a baby is born in England; one for the mother and one for each baby. Each consists of an Admitted Patient Care (APC) record common to all hospital in-patient stays plus a 'tail' with information about the birth (online supplemental figure S1). The mother's HES delivery record contains APC information relating to the mother's delivery and a 'maternity' tail'. The baby's HES birth record contains the baby's APC record and a 'baby tail' containing details of the birth and overlapping extensively with the 'maternity tail'. Maternity HES data are downloaded from hospitals' administrative systems. As these do not all come from the same supplier, there are some differences in the ways in which data are entered and there are differences between systems and hospitals in the extent to which data items are missing.
The team at City, University of London, has already evaluated the quality of linkage between birth registration, birth notification and the mother's HES delivery records for births from 2005 to 2014. The authors reported a linkage rate of 95% and uncovered some linkage errors. 7 Therefore, the main objectives of the current study were to assess the quality of linkage between baby's birth registration and notification records and the baby's HES birth records, and to evaluate the potential bias introduced to the study cohort by the linkage. This has relevance for analyses of similar linked administrative datasets.

METHODS
All live, singleton babies born in NHS hospitals between 1 January 2005 and 31 December 2006 to a mother living in England were eligible for inclusion in this study cohort. The analysis was restricted to births in NHS hospitals, but they accounted for 96.6% of women giving birth in 2006. Home births accounted for 2.7% of deliveries in 2006, but although most received NHS care, many were not recorded on hospital systems and so not included in HES. There were extremely few HES records for the 0.5% of deliveries in non-NHS hospitals and the 0.2% delivering elsewhere. 8 All analyses were conducted using STATA V.14 9 within the ONS' Secure Research Service (SRS). An overview of the procedures involved in the QA process for this study is presented in figure 1.

Data sources
Two datasets containing data about births in England from 1 January 2005 to 31 December 2006 were linked ( Figure S1). They and the linkage file were saved as three separate STATA datasets. The datasets included: (1) ONS births; (2) HES birth records; and (3) the linkage file containing unique ONS birth identifiers (ONSID) and corresponding unique HES identifiers (HESID). These files are described in the online supplemental information (Online supplemental figure S1).

ONS births
The master dataset comprises data from two sources: birth registration and birth notification. These two datasets have been routinely linked by the ONS since 2006 and the combined dataset is referred to as 'ONS births' throughout this paper. ONS births contains personal identifiers, sociodemographic characteristics and birth characteristics. 10 Hospital Episode Statistics (HES birth records) HES is a large database containing records of all episodes of care and births in NHS hospitals in England since 1989. The records used here are from the HES APC dataset, which contains records of all inpatient admissions, including birth and delivery records, to NHS hospitals across England. A full description of the database can be found elsewhere. 11 Briefly, HES inpatient admissions are structured as 'episodes' of care, with an episode defined as a period of care under one consultant or midwife. Each episode contains details relating to the individual, care provider and care received (including diagnosis and procedural codes). If a patient receives care in more than one department, this generates multiple episodes, referred to as a 'spell' and represents an uninterrupted period of care within one hospital. A new spell is generated when the patient is transferred to a different hospital to continue care. A continuous inpatient stay may consist of one or more episodes and spells, and ends when the patient is discharged from an NHS hospital. Data about hospital episodes are primarily collected for financial reimbursement, and therefore, the datasets are divided into financial years, beginning 1 April and ending 31 March. Episodes are labelled as 'finished' once the patient is discharged from hospital. However, if an episode begins in one financial year and ends during the next, two episodes will be generated; one in the financial year the episode begins and one in the financial year that the episode ends. In this case, the first episode will be defined as 'unfinished'.
Everyone for whom records are stored in HES is assigned a unique identifier, called the HESID. 12 This is generated using a combination of NHS Number, local patient identifier, postcode, sex and date of birth to enable data users to uniquely track patients throughout the NHS. Descriptions of the variables available are in NHS Digital's HES Data Dictionary. 13 When a baby is born, the general inpatient episode becomes part of the Maternity HES dataset and 19 additional variables relating to the delivery or birth are appended. For each birth, two maternity HES records are generated. First, a HES delivery record, which includes the general inpatient record for the mother and 19 additional variables, referred to as the maternity tail. Second, a HES birth record, which includes the general inpatient record for the baby, along with 19 additional variables, referred to as the baby tail. The maternity and baby tails contain similar information relating to the delivery and birth; however, this study evaluated the linkage of HES birth records only. These additional data items in the baby tail include variables such as gestational age at birth and neonatal level of care. For a full list, see section A in the online supplemental information.

Linkage file
The linkage file contained the unique identifiers from each dataset as linked by NHS Digital. The dataset contained unique identifiers from ONS births (ONSID) that had been successfully linked with the corresponding unique identifier from HES births (HESID) and those which had not. It also contained unique HES birth Open access identifiers that were not linked to a unique ONS birth identifier.

Linkage of ONS births and Maternity HES
Linkage between ONS birth records and HES birth records is a two-step process. First, ONS birth records were linked to the HES patient index (an index of personal details relating to all individuals with access to NHS hospitals) by NHS Digital using a deterministic algorithm in order to assign the HESID; and second, records were linked to the corresponding HES birth records using the HESID and other identifiers. 5 Here, each linked record was assigned a match rank score, indicating the stage in the algorithm at which the records had been matched (one being highest quality). Steps of the algorithm are summarised in online supplemental table S1.

Data preparation for QA checks
Full details of available variables, data cleaning procedures and preparation steps for the included datasets are described elsewhere. 6 Many key variables were available from both birth registration and birth notification. The preferred sources are summarised in the online supplemental table S2. A number of additional steps were taken to ensure that key variables in HES births and ONS births were in consistent formats, ready for comparison during the QA procedure (online supplemental table S3). Further checks were conducted on HES birth records and those with: (1) discharge date occurring before admission; (2) discharge occurring before baby's date of birth; or (3) admission dates occurring before 1 January 2005 or after 31 December 2006 were excluded.

Multiple episodes with the same HESID (i.e. duplicate HES birth records)
Birth episodes with the same HESID can occur due to a number of reasons: (1) unfinished episodes, such as a baby being born on the 30 March, but then discharged on the 1 April 11 ; (2) administrative errors 7 ; (3) a birth spell containing multiple episodes, such as a baby being transferred to a different hospital or consultant; and (4) HESID is incorrectly assigned to more than one baby. 14 The steps which were taken have been summarised in the online supplemental information. In addition to this, episode start dates that occurred more than two days after the previous episode ended and did not include a transfer code (see online supplemental information) were considered a new admission and were saved separately.

QA of ONS births and Maternity HES linkage
After merging the three datasets, we identified cases where two different ONS birth records had been linked to the same HES birth record and key characteristics were compared to identify the correct link. This was done by comparing the following variables which were available in both data sources: baby's date of birth, gestational age, birth weight, sex, mother's date of birth and postcode. The record with the highest number of matching variables was identified as the correct link. When records matched on the same number of variables, the record that matched exactly on birth weight was identified as the correct match. If birth weight was missing or did not match with either record, then the record with the highest match rank score from the original linkage (online supplemental table S1) was identified as the correct link. The remaining records were then compared manually, but in cases where records had a high proportion of missing data or the same number of matching variables and the same match rank score, it was not possible to identify the correct link and therefore all were excluded. The deterministic algorithm developed for evaluating the record linkage is summarised in table 1. The QA algorithm was adapted from a previous study which assessed the quality of linkage between ONS birth records and the mother's HES delivery records, 7 and was based on location of birth, baby's date of birth, sex, birth weight and gestational age, mother's date of birth and postcode. The location of birth was defined as the NHS hospital trust running the hospital that the baby was born in. An NHS hospital trust in England is an organisational unit within the NHS and usually refers to a group of hospitals that Open access are in close proximity to each other. This was used instead of the hospital of birth to account for potential transfers between hospitals within the birth admission. The hospital trust variable was developed as part of a previous study. 6 To account for differences in rounding between hospitals, a birth weight of + or -100 g was considered a match.
Final checks after the QA procedures were completed included: (1) checking that the baby's date of birth in ONS births was within the admission and discharge dates from the HES birth record; (2) baby's date of birth in ONS births did not occur before 1 January 2005 or after 31 December 2006; (3) hospital discharge date did not occur before the admission date; and (4) further checks to ensure that all stillbirths had been excluded (International Classification of Diseases, version 10 (ICD10) diagnosis code=Z37.1).

Assessment of linkage bias
The distributions of key characteristics were compared between the linked and unlinked study samples and the eligible study sample. The χ 2 test was used to assess whether distributions differed significantly between the linked and unlinked study samples. Because the sample sizes were so large, we also looked at differences in distributions of variables, for example, differences between proportions of more than 1%. In addition to this, we compared the distributions of key characteristics between the linked study sample and all live births in England in 2005 and 2006 to see what effect excluding stillbirths, multiple and non-NHS births had on the study cohort.

Patient and public involvement
The TIGAR study was supported by a patient, parent and public advisory group, which provided input to different aspects of the study. This group met at the start of the study and gave input into the study protocol and the lay summary of the project.

RESULTS
There were 1 257 884 ONS birth records. After excluding multiple births and babies who were not eligible for the TIGAR study, 1 170 790 live, singleton babies born from 1 January 2005 to 31 December 2006 in NHS hospitals to women living in England remained in the study cohort. There were 1 243 373 HES birth records and the linkage file included 1 242 938 records (figure 1).

Linkage of ONS births to Maternity HES
Of the 1 170 790 eligible ONS birth records, 1 074 571 (92%) were successfully linked with a HES birth record. Of the 96 219 unlinked records, 88 471 had a HESID but no corresponding HES birth record and 7747 had no corresponding HESID in the linkage file. The majority of records linked had a match rank score of one, which meant they exactly matched on all four variables in the NHS Digital algorithm, with fewer than 1% of records linking in stages 3-6 of the algorithm (online supplemental table S1). A higher proportion of links had a match rank score of one in 2006 compared with 2005, however.

Data preparation for QA checks
All key variables in ONS birth records had data missing for less than 1% of births. In contrast, data were missing from substantial numbers of HES birth records (online supplemental table S4) Multiple birth episodes with the same HESID Of the 1 243 373 HES birth records, 73 307 (6%) had linked with more than one record with the same HESID. Of these, the most common reasons for this were multiple episodes within a hospital spell (62%), unfinished episodes (15%) and duplicate episodes due to administrative errors (11%; figure 1). There were 1 197 999 unique HES birth records remaining.

QA of the ONS births and Maternity HES linkage
A total of 979 (0.1%) HES birth records had linked with more than one ONS birth record. The high proportion of missing data in HES birth records meant that it was difficult in many cases to identify the correct match. Of these, 499 records were judged to be incorrectly linked and were broken. It was not possible to ascertain the status of 95 records, which had the same matching variables and match rank score. A lot of data items in these records were missing data and all were excluded (table 2).
Key variables of the 1 074 571 (92%) ONS birth records which had been successfully linked were then compared using the algorithm in table 1. The vast majority (99.5%) of correct links were found to be within the first three stages of the QA algorithm (table 3). The 645 records identified as correct links in stage 4 of the algorithm appeared to be correct matches; the hospital location code differed slightly, which may have resulted from either a transfer to a neighbouring hospital trust during the birth admission or from data entry errors. In stage five, 1973 (0.2%) records were identified as correct links. Of these, 76% had a match rank score of one, suggesting the date of birth had been entered incorrectly in the HES birth record. The majority of these matches looked like data entry errors where the month and day had inadvertently been swapped. In stage six of the algorithm, a small number of records were identified as correct links. These records had a lot of data missing, including 64% of birth weights, 70% of gestational ages, 77% of postcodes; and 67% of mother's dates of birth. Of the 242 records with partially matching dates of births, 91% had a match rank score of one or two, suggesting good quality matches but with data entry errors in the HES birth records. Interestingly, almost 62% of records in this stage exactly matched on date of birth, but differed by sex. Most of these records also exactly matched on birth weight, gestational age, postcode and mother's date of birth, suggesting they are Open access correct links, again with data entry errors in the HES birth record.
Overall, 860 records with incorrect links were identified and 64% of these were in births in 2005, suggesting an improvement in data and linkage quality in 2006. When exploring these broken links in relation to their match rank score, 69% had a score of six, meaning their NHS numbers were missing. The main reasons for broken links were completely different dates of birth, as opposed to exact or partial matching, and data missing for large numbers of variables. Among these records, birth weight, gestational age, postcode and mother's date of birth was missing for approximately 92%, 91%, 85% and 91%, respectively. There were a small number of records that appeared to be incorrectly broken due to cleaning errors, for example, postcodes which included an 'O' instead of a zero.
A number of interesting observations were made when reviewing the broken links: (1) many that were broken differed by gestational age, but in most cases by just 1 week and the babies were full term, for example, 39 and 40 weeks; and (2) more than one-third of broken links were for babies born in London. As the numbers of broken links were small, distributions of numbers of broken links cannot be presented, as they could be disclosive.  suggesting an improvement in data quality between the two years. The addition of partially matching date of birth in the QA algorithm increased sensitivity and identified more correct linkages, which would have otherwise been missed from the study cohort. Duplicate HES records and large amounts of data missing for some variables created challenges when assessing linkage quality, highlighting the importance of these procedures before beginning statistical analysis. Finally, birth records were more likely to not link if babies were of white, British ethnic origin, born outside marriage, born in East England, London, North West England or the West Midlands, or born in March.

Strengths and limitations
Key strengths of this study included building on previous work conducted into the linkage of mothers' hospital records with babies' birth records, obtaining data from multiple sources, and the use of personal identifiers, which allowed us to evaluate the linkage more easily. However, the need to access personal identifiers also increased the problems involved in accessing the data, notably in terms of the length of approvals processes. Other limitations include the restriction of the assessment of linkage to live, singleton births in 2005 and 2006. We were unable to assess how well the algorithms designed for this study would perform with multiple births. These tend to have more complex data, although similar algorithms were successfully used for the quality of assurance of the linkage of multiple births to mothers' delivery records. 6 In addition, the algorithms used in this study were deterministic, rather than probabilistic. The latter can be more effective when dealing with complex records, such as those with a lot of missing data or coding errors. 15 16 Interpretation of findings Approximately 8% of ONS birth records did not link to a HES birth record. One of the key reasons for unlinked records appeared to be missing data in key linking fields. Quality and completeness are always a concern when using routinely collected datasets and the large sample sizes they provide are always traded off with these limitations. Data quality was somewhat poorer in HES birth records than in ONS birth records and this is a commonly cited issue when working with HES data. 17 18 In studies using HES birth data in later years, the data quality improves with time. In 2009/2010, the baby tail is far more complete, with only 18% and 14% of births with missing gestational age and birth weight, respectively. 17 Therefore, researchers who wish to analyse the linked data for the remaining years (2007-2014) should find a higher linkage rate with better quality linkage. However, in previous work with mother's HES delivery records, the linkage rate plateaued at around 98% between 2010 and 2014 for singleton births, whereas the linkage rate in multiples began to decrease from 2010. 7 Compared with the full linked dataset, records of babies born in East England, North West England, London or the West Midlands were more likely to not link. It is likely that these variations are due to differences between hospital trusts in the ways in which definitions or protocols are used, because errors occur during the transfer of data from one organisation to another, or in the overall quality of Maternity HES for certain hospitals. It is also possible that regional reporting differences account for the higher proportion of white babies and births outside marriage in the unlinked dataset. Babies born in March were also more likely to have unlinked records and this was also the case with the mothers' HES delivery records. 7 HES uses financial years (1 April-31 March) for report, so differences in reporting standards prior to the financial year-end may account for this.
We found a small proportion of records that were potentially incorrect linkages, suggesting that linkage performed by 'Trusted Third Parties', such as NHS Digital, may not be 100% accurate. Therefore, it is essential that researchers understand the quality of linkage undertaken by a Trusted Third Party prior to performing any statistical analysis. While the responsibility should fall to the Trusted Third Party to conduct quality assessment of its linkage methods and make them publicly available to researchers, this does not happen in practice. 7 The

Open access
Digital Economy Act 2017 is designed to enable research for public benefit through the sharing of data, but the legislation is limited to the sharing of data that has been deidentified. This means reliance on the Trusted Third Party model in research is likely to increase over time. It is still unclear how this affects health and social care data, especially as they are not covered by the Digital Economy Act. 4 If researchers do not have access to personal identifiers, QA of the linkage will become more challenging. The findings in this paper will offer some insight into the quality of linkage between ONS birth and HES birth records and be of use to other researchers using the same linkage. As the quality and completeness of HES improved over time, it is likely that the quality of linkage will also have improved as a consequence, but the signs that the improvement in quality may not have been sustained is worrying.

Recommendations for future practice and research
The findings from this study have demonstrated a need for Trusted Third Parties to evaluate linkage methods and publish findings for researchers to use prior to beginning statistical analysis. In cases where this is not possible, a number of studies have shown that datasets can be linked using deidentified datasets and produce generalisable cohorts; therefore, future work could explore ways to quality assure linkage without using personal identifiers. This will increase the likelihood of easier data access, potentially reducing the time required to complete the linkage and cleaning of routinely collected datasets. Finally, we did not assess whether the quality of linkage for multiples is comparable with live, singleton, NHS hospital births although QA of the linkage of multiple births to mothers' delivery records suggested that it was better than had been initially assumed. 6

CONCLUSIONS
By linking together administrative datasets, such as birth registration, birth notification and hospital admission data, it is possible to create more complete datasets about births which can then be linked to other administrative datasets to look at longer term outcomes. 6 If data are missing or there are coding errors and inconsistences, the resulting datasets can often be of poor quality. It is therefore essential that the quality of linkage is assessed prior to analysis. The work presented in this study provides a guide to steps taken to quality assure the linkage of births in England and may be of use to other researchers working with similar datasets. The findings will be particularly useful for researchers working with the same dataset but without access to personal identifiers.
Funding This work was funded by the Medical Research Council: MR/M01228X/1. VC and MQ had full access to all the data in the study and final responsibility for the decision to submit for publication.
Disclaimer The funder had no input into the study design, data analysis, interpretation of results or writing of the manuscript.
Competing interests None declared.
Patient and public involvement Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the Methods section for further details.
Patient consent for publication Not required.
Ethics approval Ethics approval for this study was granted by the Health Research Authority Research Ethics Committee (South West -Frenchay; REC reference Ethics 15/SW/0294). The TIGAR study used data linked as part of a previous study led by City, University of London. 6 For that study, ethics approval 05/Q0603/108 and subsequent substantial amendments were granted by East London and City Local Research Ethics Committee 1 and its successors. Permission to use patientidentifiable data without consent under Regulation 5 of the Health Service (Control of Patient Information) Regulations 2002 ('section 251 support') was initially granted by the Patient Information Advisory Group PIAG 2-10(g)/2005. Renewals and amendments and a second permission, CAG 9-08(b)2014, under Regulation 5 of the Health Service (Control of Patient Information) Regulations 2002 (or 'same legislation') were granted by the Secretary of State for Health and the Health Research Authority following advice from the Confidentiality Advisory Group (CAG) to use patient-identifiable data without consent and create a research database held at the ONS for analyses relating to inequalities in the outcome of pregnancy and to inform maternity service users about the outcome of midwifery, obstetric and neonatal care. For the TIGAR study, permission from the Health and Social Care Information Centre for the work described in this article was included in Data Sharing Agreement NIC-273840-N0N0 N.
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement The authors do not have permission to supply data or identifiable information to third parties, including other researchers, but the team at City, University of London has permission under Regulation 5 of the Health Service (Control of Patient Information) Regulations 2002 to analyse patient-identifiable data for England and Wales without consent and create a research database that could be accessed by other researchers using the SRS at the ONS. The TIGAR team has permission under Regulation 5 of the Health Service (Control of Patient Information) Regulations 2002 to analyse these. Anyone wishing to access the linked datasets for research purposes should apply via the Confidentiality Advisory Group (CAG) to the Health Research Authority to access patient-identifiable data without consent and then to the ONS and NHS Digital. In the first instance, enquiries about access to the data should be addressed to Alison Macfarlane.
Open access This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https:// creativecommons. org/ licenses/ by/ 4. 0/.