Article Text

Original research
Validation of ethnicity in administrative hospital data in women giving birth in England: cohort study
  1. Jennifer Elizabeth Jardine1,2,
  2. Alissa Frémeaux2,
  3. Megan Coe2,
  4. Ipek Gurol Urganci1,2,
  5. Dharmintra Pasupathy3,
  6. Kate Walker1
  1. 1Health Services Research and Policy, London School of Hygiene and Tropical Medicine, Faculty of Public Health and Policy, London, UK
  2. 2Clinical Quality, Royal College of Obstetricians and Gynaecologists, London, UK
  3. 3Reproduction and Perinatal Centre, Faculty of Medicine and Health, The University of Sydney, Sydney, New South Wales, Australia
  1. Correspondence to Dr Jennifer Elizabeth Jardine; jennifer.jardine{at}lshtm.ac.uk

Abstract

Objective To describe the accuracy of coding of ethnicity in National Health Service (NHS) administrative hospital records compared with self-declared records in maternity booking systems, and to assess the potential impact of misclassification bias.

Design Secondary analysis of data from records of women giving birth in England (2015–2017).

Setting NHS Trusts in England participating in a national audit programme.

Participants 1 237 213 women who gave birth between 1 April 2015 and 31 March 2017.

Primary and secondary outcome measures (1) Proportion of women with complete ethnicity; (2) agreement on coded ethnicity between maternity (maternity information systems (MIS)) and administrative hospital (Hospital Episode Statistics (HES)) records; (3) rates of caesarean section and obstetric anal sphincter injury by ethnic group in MIS and HES.

Results 91.3% of women had complete information regarding ethnicity in HES. Overall agreement between data sets was 90.4% (κ=0.83); 94.4% when collapsed into aggregate groups of white/South Asian/black/mixed/other (κ=0.86). Most disagreement was seen in women coded as mixed in either data set. Rates of obstetrical events and complications by ethnicity were similar regardless of data set used, with the most differences seen in women coded as mixed.

Conclusions Levels of accuracy in ethnicity coding in administrative hospital records support the use of ethnicity collapsed into groups (white/South Asian/black/mixed/other), but findings for mixed and other groups, and more granular classifications, should be treated with caution. Robustness of results of analyses for associations with ethnicity can be improved by using additional primary data sources.

  • health informatics
  • obstetrics
  • statistics & research methods

Data availability statement

Data may be obtained from a third party and are not publicly available. Details of how to apply for data are available from the authors on request.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This study uses a large data set of ethnicity as reported to midwives at the time of booking pregnancy to validate ethnicity in administrative hospital data from birth episodes.

  • The use of routine data for validation ensures the study is large and representative.

  • The main limitation of this study is that it is restricted to largely healthy women giving birth.

Introduction

Routinely collected electronic health records offer the opportunity to evaluate care, outcomes and associations among large numbers of service users. There is wide interest in using routinely collected data to explore in more detail inequalities in care/outcomes between ethnic groups.1 The information about ethnicity in routinely collected data sources needs to be accurately recorded to enable such analyses.2 3

Hospital Episode Statistics (HES), an administrative data set which captures admissions at National Health Service (NHS) hospitals in England, records ethnicity at each attendance. Previous validation studies of ethnicity in HES have demonstrated that completeness has improved over time, and that there is overall good agreement between HES and general practice records4 and patient self-reported ethnic group in patients with cancer.5 However, these studies have also demonstrated heterogeneity between hospitals and disagreement between data sources in the recording of non-white ethnicity.4 5 It is unknown to what extent these discrepancies still exist, and whether similar patterns are seen in young, ethnically diverse groups such as women giving birth.

In this study, we make use of linked data on women giving birth to examine the accuracy of ethnicity recording in HES. We compare ethnicity in HES to that in electronic maternity records in maternity information systems (MIS) in England. MIS records reflect self-reported ethnicity reported to midwives at the time of the pregnancy booking appointment, where a woman’s social, medical and maternity history are comprehensively reviewed. Recording of ethnicity is mandatory within the Maternity Data Standard in England, and is used to guide care (eg, screening for gestational diabetes mellitus).6 Ethnicity data in HES is also self-reported, derived from the hospital’s record systems entered at the time of the admission (in this case, for birth).

The aims of this study were (1) to ascertain the completeness of ethnicity data in HES in the records of a young, ethnically diverse population: women giving birth in England; (2) to compare agreement between HES and maternity data sources; (3) to examine how sensitive the findings of statistical analyses are to the ethnicity data source, using rates of emergency caesarean section and obstetric anal sphincter injury as illustrative examples. Based on our findings we develop recommendations for the use of ethnicity coding and the interpretation of results using HES ethnicity data.

Methods

Data sets

This study used two data sets, linked together for the purpose of the National Maternity and Perinatal Audit (NMPA) in England: administrative data for the hospital admission resulting in the birth episode from HES, and maternity data from MIS. HES was provided via NHS Digital. Individual hospital trusts provided extracts directly from the MIS to the NMPA.7 Furthermore, for the purposes of comparison to national data on ethnic group, publicly available aggregate data from the Office for National Statistics (ONS) based on the 2011 census was used.8

The cohort consisted of 1 165 252 women who gave birth in the NHS in England between 1 April 2015 and 31 March 2017 and who had a linked record available in both MIS and HES (figure 1).

Figure 1

Flow diagram for study cohort. HES, Hospital Episode Statistics; MIS, maternity information systems.

Linkage

Data sets were linked by a trusted third party (NHS Digital) using deterministic methods based on the NHS number, postcode and maternal date of birth.9

Coding of ethnicity

All three data sets code ethnicity using the 16+1 ONS categorisation system from the 2001 census.10 Ethnicity was considered ‘complete’ if it was not missing and not ‘unknown’. For the purposes of understanding varying levels of granularity, ethnicity was considered both as individual codes and collapsed into five aggregated groups used by ONS: white; South Asian or British Asian; black or black British; mixed and other (including Chinese and other (free text, not categorised)).

Analysis

For 1 165 252 women where both HES and MIS records were available, ethnicity codes were compared for completeness. Cross-validity was checked for individual ethnicity codes and by the five aggregated ethnic groups. Agreement was assessed using Cohen’s kappa (κ) statistic, which measures the level of agreement of a categorical variable between two different sources on a scale from 0 to 1, taking into account the probability of chance agreement.11

To evaluate how sensitive statistical analysis of results were to the data source for ethnicity, we examined the relationship between ethnicity and rates of a common outcome of birth (emergency caesarean section) and an uncommon outcome (obstetric anal sphincter injury (OASI)) known to be associated with ethnic group.12 13 Both outcomes are well coded and are used for national quality comparisons.7 Women were included in this analysis if they had a singleton birth at term (37+0 to 42+6 weeks). Definitions of singleton birth at term, emergency caesarean section and OASI were made using the coding framework developed by the NMPA.14 Poisson regression was used to examine the associations between each outcome and ethnic codes and ethnicity collapsed into groups, using recorded ethnicity in each of HES and MIS.

Two supplementary analyses were carried out. First, the frequency of complete ethnicity codes was tabulated and compared with the published ethnicity of women aged 16–49 from the 2011 census.8 Second, in order to assess whether there was any bias in linkage to HES, the likelihood of having a linked record was tabulated by ethnic group for all women with a record in HES.

All analyses were performed in Stata V.14.1.

Data access statement

The data are available for further research and service evaluation following approval from the data controllers, which are the Healthcare Quality Improvement Partnership (www.hqip.org.uk) for the data derived from the MIS and NHS Digital for HES.

Patient and public involvement

The NMPA advisory group for inequalities provided the motivation for investigating this question, and will guide the dissemination of this research.

Results

Data completeness

Complete codes for ethnicity were present in 91.3% of the 1 165 252 HES records linked to MIS. Among the 1 165 252 women, 95.5% had a complete code for ethnicity in at least one of HES and MIS (table 1).

Table 1

Assessing agreement between ethnicity group recorded in MIS and in HES for 1 165 252 women with records in both data sets

Agreement between ethnicity in HES and MIS

Of the 1 165 252 women with records in both HES and MIS, 1 007 881 (86.5%) had complete ethnic codes in both data sets. The overall agreement between aggregated ethnic groups was 94.4% (κ=0.86) and between individual ethnic codes was lower at 90.5% (κ=0.83) (table 2).

Table 2

Overall agreement between ethnic origin coded in HES and MIS in 1 007 881 women with complete ethnicity in both data sets

When ethnicity was recorded as white, South Asian or black, there was between 91% and 99% agreement between data sources on aggregated ethnic group, with the highest agreement in women recorded as white (table 1). The largest discrepancy between HES and MIS was in the recording of women with mixed ethnicity (table 1). A larger proportion of women were coded as mixed ethnicity in HES than in MIS (table 1). Of the women coded as mixed ethnicity in HES, only 35% were recorded as mixed ethnicity in MIS, with 43% recorded as white in MIS (table 1 and online supplemental table 1). For women recorded as mixed ethnicity in MIS there was a relatively high agreement in HES (72%).

For women with ethnic group recorded as a mix of two ethnicities in one data set, they were often recorded in the other data set as just one of these two ethnicities. For example, of those women coded as ‘White/South Asian’ in MIS, 59% were assigned to the same group in HES, 15% were coded in a ‘White’ group in HES; 10% were coded in a ‘South Asian’ group, 5% had no ethnicity recorded in HES; and 7% were ‘Other Mixed’ (online supplemental table 1). None of these codes are fully inconsistent with the ‘White/South Asian’ group in MIS. Similar patterns were seen for groups within white and black: only 60% of those recorded as ‘White Irish’ in MIS were assigned to the same group in HES, but a further 31% were assigned to ‘White British’ or ‘Other White’ groups; for women coded as ‘Other Black’ in MIS, only 45% were assigned to the same group in HES (the lowest agreement of all groups) but a further 38% were recorded as ‘Black African’ or ‘Black Caribbean’ which again may not be fully inconsistent (online supplemental table 1).

Sensitivity of statistical analyses to ethnicity data source

The overall rates and rate ratios comparing OASI and emergency caesarean section by ethnic group were very similar regardless of whether HES or MIS was used to classify ethnicity (table 3 and online supplemental table 2). However, small differences were seen in the estimates of the rates of caesarean section, and the rates and rate ratios of OASI, in the mixed and other groups, which were the ethnicity groups with the lowest agreement between data sources. For example, the estimated rate of caesarean section in women from mixed ethnic groups was 14.9% (95% CI 14.4% to 15.4%) in MIS and 15.3% (14.9% to 15.6%) in HES (table 3).

Supplemental material

-4
Table 3

Rates of emergency caesarean section and obstetric anal sphincter injury by ethnic group using both HES and MIS coding structures, among 1 056 029 women with a record in both data sets who had a singleton term birth between 1 April 2015 and 31 March 2017

Supplementary analyses

The prevalence of white ethnicity was lower in women in this study than in women aged 16–49 in the aggregate census data (82.6% in census data, and 77.1% and 75.7% among complete values in MIS and HES, respectively). Women in HES were twice as likely to have their ethnicity recorded as mixed as women in MIS or the census (4.0% compared with 1.9% and 2.3%, respectively) (online supplemental table 3).

Women whose recorded ethnicity was black or mixed, or who did not have a recorded ethnicity, were less likely to have had complete identifying information in both data sets to enable linkage between the data sets than white women (7.1%, 7.3% and 7.6% unlinked compared with 5.7%); women with ethnicity recorded as South Asian were more likely to have linked data (4.6% unlinked) (online supplemental table 4).

Discussion

Main findings

Ethnicity is complete in administrative hospital records for 91% of women giving birth in England. Overall, administrative and maternity data sets demonstrated very good agreement on aggregated ethnicity group, with κ over 0.85. However, there was poor agreement on the recording of mixed ethnicity, with a substantial proportion of those women coded as mixed ethnicity in HES recorded as white in MIS. In addition, women who had their ethnicity coded as black or mixed in their maternity record were less likely to have their record linked to an administrative record. Estimates of associations between ethnicity and each of a common and uncommon outcome were largely unaffected by the data source for ethnicity.

These results indicate that ethnicity in HES for women giving birth in England is highly complete, with good validity when compared with other data sources, and can be used to draw robust conclusions about associations between aggregated ethnicity groups and outcomes. The exception is with the coding of mixed and other ethnicity, for which there is a coding issue and results are not entirely robust to the choice of data source. Furthermore, for analyses using linked data sets, statistical approaches such as methods of imputation for missing data, are needed to deal with the lower linkage rate for women from black and mixed ethnic backgrounds so that these women are not under-represented in such studies. Studies which are restricted to only those individuals with complete information about ethnic group may exclude a substantial proportion (in our study, 9%) of the population.

Strengths and limitations

This study uses a large data set of self-reported ethnicity in a young, ethnically diverse population to validate ethnicity in HES. MIS have been in widespread use for more than a decade; the recording of this information was mandatory at the time of coding15 and is known to be used for quality monitoring.7 16 Primary data collection on self-reported ethnicity would have to be extensive to ensure appropriate representativeness and this is expensive and logistically challenging. Our approach, using two routine data sets to establish validity, ensures the study is robust while maintaining feasibility and cost-effectiveness.

The main limitation of this study is that it is restricted to largely healthy women of childbearing age. However, there is no reason to think that these findings would not be translatable, and that the quality of ethnicity data would not be as good, in other groups of healthcare service users. A further limitation is that the two data sources may not be entirely independent; it is possible that in some hospital settings or in some cases, both sources are derived from the woman’s reported ethnic group at the time of booking her pregnancy (eg, if coding in both data sets is derived from a single set of paper medical notes).

Comparison with existing literature

In common with previous studies in other areas of health and older populations, we found that recording is more often inconsistent between data sets in mixed ethnic groups.4 5 Previous studies have demonstrated that the quality of routinely collected ethnicity data in HES has increased over time.4 We provide up-to-date information about the validity of ethnicity recording in HES, and additionally demonstrate that completeness of information required for linkage to external data sets is lower in minority ethnic groups.

Our cohort has a higher proportion of non-white women than the 2011 census, and particularly of women from South Asian backgrounds. This finding may be partially explained by population changes in the intervening years, but also aligns with previous evidence that women in South Asian groups, particularly Pakistani and Bangladeshi women, have a higher fertility rate than other ethnic groups in the UK.17

Our finding that women from minority ethnic groups are less likely to have the relevant information to enable linkage to other data sets including HES has been demonstrated elsewhere.18 This is an important source of potential bias in analysis.

Implications

COVID-19 has emphasised the extent to which existing ethnic and socioeconomic inequalities continue to govern health outcomes.1 19 In women giving birth, as across many areas of healthcare, it is well recognised that those from non-white ethnic groups experience poorer outcomes in the UK and across the world.20–25 Reducing these inequalities requires a multifaceted approach, including access to good-quality data for monitoring care and outcomes stratified by ethnic group.2 Electronic health records offer the potential to understand the associations between ethnicity and healthcare and outcomes in more detail, using statistical methods to understand to what extent associations are mediated through other factors such as socioeconomic deprivation and comorbidities. The findings of this study demonstrate that such studies could draw robust conclusions.

The potential of these analyses is, however, limited by incompleteness, inconsistencies and selective missingness in records for individuals from ethnic minority groups, including missing identifying information which may inhibit linkage. Reasons for this are likely to be multifactorial. Women from ethnic minorities are more likely to be recent immigrants to the UK and therefore to have no NHS number, which would enable linkage from their maternity records to HES. Women born outside the UK or in ethnic minority groups are also more likely to book late for antenatal care, which may limit the completeness of their data.21 26 27

Among individuals from mixed groups, while there is increased discrepancy between data sets the coding is not always fully conflicting, with many women recorded as a mix of two ethnicities in one data set recorded as only one of these in the other data set. This may be due to limitations in either or both of the design of the data collection framework and the input of the data.2 The design of the current classification system is limited: for many people from mixed ethnic groups, which are heterogeneous, there are not appropriate categories for inclusion. This may directly affect the data input: when faced with a classification which does not adequately reflect their ethnic group, individuals may default to a choice which is consistent with societal expectations and inbuilt structural racism,2 which may in part explain why in this study 42.6% of women who were coded in HES at birth as mixed reported their ethnicity to their midwife as white. Furthermore, inconsistent input may reflect true variation in self-perceived ethnicity: there is evidence from sibling studies that there may be uncertainty about parental ethnic origin, leading to inconsistent self-reports of ethnic group.28 Some established minority groups, for example, mixed groups which do not fall into available categories (eg, mixed black/South Asian), are not explicitly included in the current classification system. This lack of inclusion limits appropriate classification, introducing inconsistencies in the data and preventing studies from establishing health outcomes in these minority groups.

For researchers and policymakers using this data, it is important to understand the potential biases introduced by misclassification and missing information. Robustness can be improved by using linked data sets to improve completeness and performing sensitivity analyses to assess bias in recording of ethnicity. The strength of analyses can be further improved by clear approaches to missing data: if, as in this study, other linked data sources exist, these can be used to inform imputation procedures for missing data. Rather than simply infilling values from one data set into another, values of ethnicity in one data set can be included in a model to impute missing values in the other data set using multiple imputation. This approach has the advantage that it takes into account the uncertainty due to missing data while incorporating the (very informative) information in the other data set.

We recommend that those using ethnicity data in HES where possible test their results by repeating their analysis using other available primary sources, and that HES is used primarily to draw conclusions about associations between aggregated ethnic group and outcomes, with use of individual ethnic groups treated more cautiously, particularly in mixed and other ethnic groups.

Conclusions

Our findings support the validity of the use of ethnicity, collapsed into aggregated groups of white/South Asian/black/mixed/other, in administrative hospital data in England: both for monitoring care by ethnic group and to understand the associations between ethnic group and outcomes. However, while ascertainment of ethnic group can be improved by using multiple data sources, there remains a need to improve the completeness and accuracy of recording, particularly among people from mixed ethnic groups, where reporting may be limited by a lack of appropriate categories and may be vulnerable to inconsistencies in self-reporting. Researchers and analysts should be aware of the potential for misclassification bias, particularly among mixed and other ethnic groups and when the most granular level of available data are used. Analysts should also be aware of the potential for linkage bias due to lower levels of identifying information, required for linkage, in records of individuals from ethnic minority groups. National efforts are required to improve the quality, completeness and accuracy of coding of ethnic group in administrative hospital data; to ensure equity in the recording of identifying information; and to provide appropriate and up-to-date classification systems for ethnicity.

Data availability statement

Data may be obtained from a third party and are not publicly available. Details of how to apply for data are available from the authors on request.

Ethics statements

Ethics approval

This study used data collected to evaluate service provision and performance and therefore it was exempt from ethical review by the NHS Health Research Authority. The use of data without patient consent was approved by the Confidentiality Advisory Group of the NHS Health Research Authority for the purpose of national clinical audit and health service evaluation (16/CAG/0058).

Acknowledgments

We are grateful to NHS Trusts and NHS Digital for supplying the data for this study, and to Jan van der Meulen for his comments on drafts of this manuscript.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Twitter @jenejardine

  • Contributors JEJ and KW conceived the study. JEJ and AF performed the analyses. JEJ, AF, MC, IGU, DP and KW evaluated the results. JEJ wrote the first draft. JEJ, AF, MC, IGU, DP and KW revised the paper. KW is senior author.

  • Funding The National Maternity and Perinatal Audit is commissioned by the Healthcare Quality Improvement Partnership (HQIP), as part of the National Clinical Audit and Patient Outcomes Programme funded by NHS England and the Scottish and Welsh Governments. Neither HQIP nor the funders had any involvement in designing the study; collecting, analysing and interpreting the data; writing the report; or in making the decision to submit the article for publication.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.