Article Text

Download PDFPDF

Comparison of ethnic group classification using naming analysis and routinely collected data: application to cancer incidence trends in children and young people
  1. Lesley Smith1,
  2. Paul Norman2,
  3. Melpo Kapetanstrataki1,
  4. Sarah Fleming3,
  5. Lorna K Fraser4,
  6. Roger C Parslow1,
  7. Richard G Feltbower1
  1. 1 Division of Epidemiology and Biostatistics, School of Medicine, University of Leeds, Leeds, UK
  2. 2 School of Geography, University of Leeds, Leeds, UK
  3. 3 Cancer Epidemiology Group, Section of Epidemiology and Biostatistics, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
  4. 4 Department of Health Sciences, University of York, York, UK
  1. Correspondence to Lesley Smith; L.F.Smith{at}


Objective Inpatient Hospital Episode Statistics (HES) ethnicity data are available but not always collected and data quality can be unreliable. This may have implications when assessing outcomes by ethnicity. An alternative method for assigning ethnicity is using naming algorithms. We investigate if the association between ethnicity and cancer incidence varied dependent on how ethnic group was assigned.

Design Population-based cancer registry cohort study.

Setting Yorkshire, UK.

Participants Cancer registrations from 1998 to 2009 in children and young people (0–29 years) from a specialist cancer register in Yorkshire, UK (n=3998) were linked to inpatient HES data to obtain recorded ethnicity. Patients’ names, recorded in the cancer register, were matched to an ethnic group using the naming algorithm software Onomap. Each source of ethnicity was categorised as white, South Asian (SA) or Other, and a further two indicators were defined based on the combined ethnicities of HES and Onomap, one prioritising HES results, the other prioritising Onomap.

Outcomes Incidence rate ratios (IRR) between ethnic groups were compared using Poisson regression for all cancers combined, leukaemia, lymphoma and central nervous system (CNS) tumours.

Results Depending on the indicator used, 7.1%–8.6% of the study population were classified as SA. For all cancers combined there were no statistically significant differences between white and SA groups using any indicator; however, for lymphomas significant differences were only evident using one of the ‘Combined’ indicators (IRR=1.36 (95% CI 1.08 to 1.71)), and for CNS tumours incidence was lower using three of the four indicators. For the other ethnic group the IRR for all cancers combined ranged from 0.78 (0.65 to 0.94) to 1.41 (1.23 to 1.62).

Conclusions Using different methods of assigning ethnicity can result in different estimates of ethnic variation in cancer incidence. Combining ethnicity from multiple sources results in a more complete estimate of ethnicity than the use of one single source.

  • Ethnicity
  • Hospital Episode Statistics
  • naming algorithm
  • childhood cancer

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • We compared ethnic differences in cancer incidence trends in 0–29 years using ethnicity recorded in Hospital Episode Statistics and results from the naming algorithm program Onomap.

  • Combining ethnicity data from more than once source ensured complete ethnicity information was available for all patients in the study.

  • We considered three broad ethnic groups, white, South Asian and Other, but were unable to look at specific ethnic groups within these due to small sample sizes.

  • Name analysis is limited for the classification of ethnicity particularly to individuals of mixed ethnicities.


In order to investigate ethnic differences in health, reliable and complete data on ethnic group must be collected. Ideally this information will be collected at the individual level through a self-reported choice within a pre-established classification of ethnic groups, such as those defined by the Office for National Statistics.1 However in routine data sets this information is not always collected or the quality of the data collected may be unreliable.

Since 1995 in England it has been mandatory to collect ethnic group data for all hospital inpatient admissions, which are recorded in Hospital Episode Statistics (HES),2 which should be self-reported by the patient. Initially high levels of missing ethnic codes (either reported as ‘not known’ or ‘not stated’) were reported; in 1996/1997, 44% of inpatient admissions had missing ethnicity recorded.3 However, over time the completeness of ethnicity recording has improved; the percentage of hospital admissions with a missing ethnic group recorded decreased to 14% in 2011/2012.4 Analysis by patient age has shown that children have the highest percentage of missing ethnicity data within HES.5 Furthermore, there is evidence of incorrect coding of ethnicity within HES, which is particularly an issue for non-white patients.4 6

When routinely collected self-reported ethnicity data are not available, naming algorithm software provides a cost-effective method to assign an ethnic group to each individual, provided there is access to the individual’s name.7 When some degree of ethnicity information is already available, name-based classification can provide complementary information to help detect inconsistencies and complete missing data.7

South Asians include individuals whose ancestry is in the countries of the Indian subcontinent, including India, Pakistan and Bangladesh,8 and they make up the largest minority ethnic group in England and Wales, comprising 5.3% of the England and Wales population in the 2011 census.9 Two commonly used naming program to classify individuals into South Asian or non-South Asian groups are Nam Pehchan10 and South Asian Names and Group Recognition Algorithm (SANGRA).11

More recently Onomap has been developed, based on surnames and forenames from public name registers from over 26 countries, which classifies individual names into cultural ethnic or linguistic groups.12 13 It includes all ethnic minority groups in the UK, and unlike Nam Pehchan and SANGRA is not limited to the South Asian group only. The Onomap taxonomy classifies names into one of 185 different types, which are nested within 66 subgroups, which are then nested within 16 larger groups. Onomap has been validated based on regional data in Scotland on two data sets, birth registrations and a school pupil census, with the sensitivity for South Asians ranging from 72% to 75%.14 Onomap defines one group of names as Muslim with origins in the Middle East and this group cannot be directly mapped to the South Asian group. However, in the birth registration study,14 when this group was added to the South Asian group, the sensitivity increased to 90%.14 Onomap has also been validated with HES-recorded ethnicity in patients with cancer in the West Midlands with a sensitivity of 99.8% for white patients and 82% for South Asians, but it was very low for black and Chinese/Other groups (4% and 0%, respectively).15 This study assumed that HES-recorded ethnicity was the ‘gold standard’ and excluded 23% of patients as they did not have a valid ethnic group recorded in HES.15 Neither of these studies used the alternative ethnic group indicators to assess ethnic differences in health outcomes.

In this study we examined cancer incidence trends in children and young people, specifically focusing on groups where routine HES ethnicity data are more likely to be missing,5 using data from the Yorkshire Specialist Register of Cancer in Children and Young People (YSRCCYP).16 Our first aim was to determine the level of agreement between HES and Onomap. Second we investigated if the association between ethnicity and cancer incidence trends varied according to the ethnicity source used.


Data were extracted from the YSRCCYP database,16 which is a population-based register of children and young people (0–29 years) diagnosed with cancer residing in the Yorkshire and Humber region.16 Data on all registered patients diagnosed between 1998 and 2009 were extracted. Diagnoses were categorised into histological groups according to the International Classification of Childhood Cancer, 3rd Edition.17 We analysed incidence rates for all cancers combined and the three most common diagnostic groups in this age group: leukaemia, lymphoma and central nervous system (CNS) tumours. Other data items extracted included age at diagnosis, year of diagnosis and sex. Patients’ names were available from the YSRCCYP database only and not from HES data.

HES ethnicity

Patients were linked to inpatient HES data (1996–2015) using National Health Service (NHS) number, date of birth, sex and postcode by NHS Digital using their standard deterministic algorithm18; 97.7% of patients were matched to at least one hospital admission. Ethnicity recorded in HES is based on ethnic groups used in the census, and different ethnic groups were recorded in HES from 2001 onwards to reflect changes to the census ethnic group categories,19 for example the inclusion of mixed and Other Asian groups was only available from 2001 onwards. We initially considered five higher level ethnic groups (white, South Asian (including Indian, Pakistani and Bangladeshi), black, Chinese and Other (including mixed and Other South Asian)) (see online supplementary table 1). Within HES data each patient may have more than one admission and multiple ethnic codes may be recorded. For patients with multiple ethnicities recorded (based on the five categories above) (n=195, 5%), the most common ethnic group was assigned to each individual.4 20 21 Due to small numbers in the black and Chinese groups, ethnicity was then categorised as white, South Asian or Other.

Supplemental material

Onomap ethnicity

The surname and forename recorded in the YSRCCYP database for each person were matched to an Onomap type, which is the lowest level in the Onomap classification system. As above, we initially defined individuals into five higher level ethnic groups (see online supplementary table 2) but then aggregated these to white, South Asian or Other to be consistent with the ethnicity groups from HES. Details of the Onomap types included in each of the ethnic groups are shown in online supplementary table 2. The South Asian group includes Indian, Pakistani and Bangladeshi, and a further Onomap type ‘South Asian’. Names that Onomap was unable to recognise were classified as missing (n=14, 0.4%).

Supplementary table 3 shows the numbers in each of the five-category ethnic groups and subgroups for HES and Onomap.

Combined ethnicity

Two further ethnic group indicators were defined based on the combined results of the HES and Onomap ethnic groups (shown in figure 1). If both sources agreed then this group was assigned to each patient. If either source was missing (unclassified in the case of Onomap) but the other was not, then we used the available ethnic group. A further Onomap type is the Muslim group, which is defined as having origins in the Middle East and this group cannot be directly mapped to the South Asian group.14 Where Onomap assigned Muslim and HES was recorded as South Asian, we assigned a South Asian ethnicity. If Onomap and HES ethnic groups were missing, then we assigned the individual to the Other ethnic group (as not picked up as either of white or South Asian origin in the naming algorithm). There were still 151 (3.8%) patients with discrepancies between the two classifications; therefore, we created two further ethnic group indicators: one where these patients were assigned to the HES ethnic group (labelled ‘Combined HES priority’ (CHP)) and another where they were assigned to the Onomap ethnic group (labelled ‘Combined Onomap priority’ (COP)).

Figure 1

Flow chart of ethnic classification for YSRCCYP data. HES, Hospital Episode Statistics; SA, South Asian; YSRCCYP, Yorkshire Specialist Register of Cancer in Children and Young People.

We included four ethnic group indicator variables based on different sources in our analysis: HES, Onomap and ‘Combined HES priority’ and ‘Combined Onomap priority’, which included the combined categories of HES and Onomap as described above and shown in figure 1.

Populations at risk

Denominators for 1998–2009 for the Yorkshire and Humber region were aggregated from the local government districts which comprise the study area. Midyear population estimates by sex, year, 5-year age band and ethnic group (white, South Asian and Other) were obtained from the ‘ETHPOP’ database for 2001–2009.22 Data for the midyears prior to 2001 were estimated by back-casting the population. The 1998–2009 time-series of annual estimates by ethnic group were constrained to be consistent with the official midyear estimates.23 The ethnic group population estimates were based on broad ethnic groups; therefore, the South Asian group included the Other South Asian category, which was included in the 2001 census but not prior to this. In the 2001 census the Other South Asian group made up 0.5% of the England and Wales population.9 While inclusion of this group in the population denominators may bias our results as there was a discrepancy between the included ethnicities in our numerator and denominator, this was likely to be small as the estimation of rates is much more sensitive to changes in the numerator compared with the denominator.24

Statistical analysis

Agreement between ethnicity from HES and Onomap was measured by the kappa statistic. Direct age standardised incidence rates (ASR) were calculated according to the European Standard population and reported per 1 000 000 person years. Rates over time were presented using four 3-year time periods. Incidence rates were modelled using Poisson regression adjusting for 5-year age group, sex and year of diagnosis. All rates and models were calculated for each of the four ethnic group indicators for all cancers combined, then separately for leukaemia, lymphoma and CNS tumours.


Between 1998 and 2009, 3998 children and young people (0–29 years) were diagnosed with a primary cancer in the Yorkshire and Humber region. Onomap was unable to classify 14 names (0.4%) while HES ethnicity was missing in 455 (11.4%) patients; 90 of these did not match to any HES data while for the remaining 365 ethnic group was not known or not stated in the HES data. After excluding missing data, 87%–88% of the study population were classified as white. The percentage of patients classified as South Asian was 7.1% for HES and Onomap, 7.5 for CHP and 8.6% for COP (table 1). The percentage of patients in the Other ethnic group ranged from 3.0% (COP) to 5.3% (CHP). There was excellent agreement between HES and Onomap ethnic groups (kappa=94.1%).

Table 1

Ethnic group classification by source, n (%)

For the white group using HES only provided lower estimates of incidence rates than using Onomap or either of the ‘Combined’ indicators, which all produced similar rates (figure 2, table 2): for all cancers combined, ASR for HES 154 per 1 000 000 population (95% CI 149 to 159) compared with 174 for Onomap (95% CI 168 to 180), 172 for CHP (95% CI 166 to 177) and 174 for COP (95% CI 169 to 180). For the South Asian group, the ASRs were lowest based on HES (ASR=138 (95%CI 121 to 155)) and highest based on COP (ASR=187 (95% CI 167 to 206)). The ASRs for the Other group showed the greatest variability ranging from 135 (95% CI 110 to 159)) based on COP to 241 (95% CI 209 to 274)) based on CHP. These trends were also observed across diagnostic groups (table 2).

Table 2

ASR (95% CI) (directly standardised to the European Standard population) per 1 000 000 population by ethnic group, for all cancers and by diagnostic group, 0–29 years, Yorkshire, 1998–2009

For all cancers combined, generally the trends over time were in the same direction for all four indicators for the white and South Asian groups; for the white group the rates were stable over time and there was some indication of an increase over time for South Asians. For the Other ethnic group the direction of the trends varied by indicator mainly due to small numbers within this group (figure 2).

Figure 2

Trends in ASR by ethnic group using different ethnicity classifications, all cancers combined. ASR, age standardised incidence rate; HES, Hospital Episode Statistics.

Results from the Poisson regression models (table 3) showed that for all cancers combined, statistically significant differences between ethnic groups were evident using three of the four indicators but with differing direction of associations found between the white and Other ethnic groups. Incidence was 32% higher in the Other ethnic group based on HES (95% CI 1.13 to 1.53), 41% higher based on Onomap (95% CI 1.23 to 1.62) and 22% lower based on COP (95% CI 0.65 to 0.94). For lymphoma incidence was higher in South Asians when using COP only (incidence rate ratio=1.36 (95% CI 1.08 to 1.71)), while for CNS tumours relatively consistent differences were found based on three indicators (HES, Onomap and CHP), where incidence was between 27% and 35% lower in South Asians compared with whites.

Table 3

IRR and 95% CI from multivariable models by ethnic group, for all cancers and by diagnostic group, 0–29 years, Yorkshire, 1998–2009


It is essential to have valid and complete ethnicity data in order to investigate differences between ethnic groups and develop strategies to tackle inequalities and allocate resources appropriately.25 We set out to determine if using different sources of ethnicity data would result in different conclusions being made when assessing ethnic inequalities in cancer incidence in children and young people, comparing ethnic groups recorded in HES and ethnic groups based on patient names obtained from the Onomap program. These two sources were then combined to create alternative ethnic groups using information from both sources. Our results show that in general incidence trends over time were similar in direction regardless of the ethnicity source used. However, the size of the estimates and the magnitude of the differences between ethnic groups varied depending on the indicator used. This may even lead to differences in statistical significance and therefore different conclusions being drawn: for example, considering all cancers combined, the two combined indicators resulted in differing directions of associations between the white and Other ethnic groups. However for CNS tumours three of the four indicators produced similar estimates.

Some studies have found incidence rates in children and young adults to be higher in South Asians for all cancers combined,26–28 while others have found similar rates in South Asians and non-South Asians.16 29 30 These studies included a variety of sources to assign ethnic group, including Nam Pehchan,26 30 a combination of Nam Pehchan and SANGRA,29 a combination of Nam Pehchan and HES,16 HES alone but only including the Indian group,27 and ethnicity recorded in medical records.28 Several of these studies used a dichotomous classification of South Asian and non-South Asians.16 26 29 30 In our study we were able to consider three ethnic groups white, South Asian and Other, although the numbers in the Other group were based on relatively small numbers. Childhood cancers are rare and our analysis showed that grouping a few cases in different groups may change the direction and statistical significance of results, as we observed for the Other group for all cancers combined.

It is now mandatory to collect ethnicity data in HES.2 Although the completeness of this information has improved over time, there are still a substantial percentage of HES episodes without ethnicity recorded,3 including 9% of our study population. These patients were excluded for the analysis of HES data and we observed that the ASR for the white group based on HES ethnicity alone was lower compared with the other ethnicity indicators, which suggests that missing HES ethnicity data may be more likely for white patients. Another issue with using HES ethnicity is that it is collected for each admission and patients in this study may have had more than one hospital admission (94% of patients had more than one admission with a median of 17 admissions per patient (IQR 7–37)), and therefore potentially more than one ethnic group recorded; this applied to 5% of our study population. Several studies have recommended assigning the most common ethnic group as this uses the most information for each patient4 20 21; we also adopted this strategy to deal with multiple recordings of ethnicity. Of the 5% of patients with more than one ethnic group recorded, according to the most common HES ethnic group, 37% were white, 30% South Asian and 33% Other, compared with Onomap ethnic group distribution of 46% white, 35% South Asian and 19% Other (agreement of 63%). We only considered three broad ethnic groups, but further discrepancies may be evident when considering the full range of ethnic groups recorded in HES. The validity of HES-recorded ethnicity has been compared with self-reported survey data for patients with cancer in England in 2011 and found that 5% of patients had ethnicity incorrectly recorded in their HES record, with greater levels of incorrect coding for ethnic groups other than white British.6 Another study assessing the completeness and usability of HES data in England from 1997 to 2001 found different ethnic groups were recorded for 6% of inpatients with ethnicity recorded on more than one occasion.4 Not only is it important to ensure the completeness of HES ethnicity, the accuracy of the data needs to be improved particularly for ethnic groups other than white British. Users of HES data should be aware of the potential misclassification of ethnicity within the data.

Onomap has been validated in several studies,14 including against HES15; however, these studies did not assess any outcomes in relation to the different ethnicity sources. Onomap has also been used to assign missing ethnicity information for patients when it was lacking from other hospital and clinical data sets.31 We used Onomap to assign ethnic group when HES data were missing, but our combined indicators also considered both sources of ethnic group to improve precision. For example, Onomap classified 3% of our study patients as Muslim with origins in the Middle East, and this group could not be mapped directly to be of South Asian origin14; however, by combining with HES data, we were able to define these patients as South Asian if this was also recorded in HES. When combining HES and Onomap ethnic groups, there were still a small number of discrepancies between the two sources and we found that using either source to assign ethnic group resulted in very similar patterns of trend and conclusions for the white and South Asian group but not for the Other group, mainly due to the small sample size.

Ethnic group information has been included in UK censuses since 1991 and provides important denominator data for the calculation of rates.32 In this study we use these denominators for all calculations of rates regardless of the indicator used (HES or Onomap or the ‘Combined’ indicators). Unfortunately we were unable to estimate the population denominator for the South Asian group excluding the Other South Asian group; therefore, there was a slight discrepancy between the definitions of the South Asian group in the numerator and denominator for the calculation of rates. However, we think this will have limited impact on our results as population estimates are in themselves an estimate and therefore prone to error, and these were based on ethnicity collected from the census combined with other sources, all of which may also be prone to classification bias.33 Further, the estimation of the rates is much more sensitive to changes in the numerator rather than the denominator, particularly in this study where we have calculated rates per million population as standard in childhood cancer epidemiology. Therefore even with these caveats the population denominator data used represent the best available data to estimate these rates and are still valid.

One of the main limitations of name analysis is the classification of ethnicity to individuals of mixed ethnicities and women’s ethnicity in mixed marriages.7 Names analysis also may not be able to distinguish between names that are common to more than one ethnic group. Over time as population trends change, such as increases in marriage between ethnic groups, the use of names analysis may be more limited.7 34 In the future we may need to collect more than ethnicity data alone and include additional items such as language and religion to represent the multidimensionality of ethnicity.35 This may be further complicated in studies of children and young people as a proxy report (from parent or carer, for example) may be needed for infants and young children until children reach an age when they are able to report their own ethnic group; therefore, depending on the age range of the study, there may be more than one method used to ascertain ethnicity.

In our study we used three broad ethnic groups, white, South Asian and Other. Although there is heterogeneity within these broad ethnic groups, we were unable to look at specific ethnic groups within these due to small sample sizes. Much of our South Asian group was of Pakistani origin, which reflects the ethnic distribution of the Yorkshire and Humber region.36 Ideally we would consider the Pakistani, Indian and Bangladeshi groups as separate groups; however, this was not possible in this study due to small numbers in the Indian and Bangladeshi groups. Due to changes in the census ethnic categories over time, we were unable to include the Other South Asian group in the South Asian group as this information was not available in HES prior to 2001, so was included in the Other ethnic group. The Other group comprises a heterogeneous mix of ethnicities, therefore limiting the interpretation of the results, but they were included in this study for completeness.

Many studies lack information on the methods used to record and classify ethnic groups35 and exclude patients where ethnicity data are missing. It is important that individuals with missing ethnicity are not excluded, and alternative methods, such as linking to other data sources and utilising naming algorithms, are used to assign an ethnic group.25 This study proposes an alternative method to combine two commonly used ethnicity sources in epidemiological studies. If ethnicity is not self-assigned by patients in a source but names are available, it seems likely that Onomap has the potential to provide results with research utility. Further examination and validation of these methods in other studies is needed.


Self-reported ethnicity is the gold standard; however, this is not always available, and in routine data sets such as HES may be incomplete or inaccurate, and therefore other ways to obtain ethnicity data may be needed, including the application of naming algorithms. Utilising different ethnicity sources alone may result in different conclusions regarding inequalities between ethnic groups; therefore, combining ethnicity from more than one source, even if incomplete, provides a more complete and robust estimate of ethnicity.


We are grateful to Paula Feltbower for meticulous data collection, and the cooperation of all oncologists, pathologists, GPs and medical records staff in Yorkshire. We thank the NHS Digital for provision of linked HES data.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.


  • Contributors LS, PN, RCP and RGF contributed to the study concept and design. LS analysed the data and drafted the manuscript. LS, PN, MK, SF, LKF, RCP and RGF contributed to the interpretation of results and critical revision of the manuscript. All authors approved the final manuscript.

  • Funding This work was supported by the Candlelighters Trust, which funded the Yorkshire Specialist Register of Cancer in Children and Young People.

  • Competing interests None declared.

  • Ethics approval The YSRCCYP has ethical approval from the Northern and Yorkshire Research Ethics Committee (MREC/00/03/001) and approval for holding identifiable patient data from the National Information Governance Board for Health and Social Care under section 251 of the NHS Act (2006).

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement Data from the YSRCCYP are not publicly available. The data supporting the conclusions of this article are included within the article.