Article Text

Download PDFPDF

Linkage of survey data with district-level lung cancer registrations: a method of bias reduction in ecological studies
  1. Gillian A Lancaster1,
  2. Mick Green2,
  3. Steven Lane1
  1. 1Centre for Medical Statistics and Health Evaluation, University of Liverpool, Liverpool, UK
  2. 2Centre for Applied Statistics, Fylde College, Lancaster University, Lancaster, UK
  1. Correspondence to:
 G Lancaster
 Centre for Medical Statistics and Health Evaluation, University of Liverpool, Shelley’s Cottage, Brownlow Street, Liverpool, L69 3GS, UK;g.lancaster{at}liv.ac.uk

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Ecological studies play a useful part in establishing an initial association between potential risk factors and disease outcomes.1 These studies use data measured at an aggregate level for geographically defined small areas, because individual-level information is not readily available or is too costly to obtain. Data are usually taken from existing disease registries, vital statistics or the national census. When the results of an ecological analysis are used to make inferences on the people living in an area, they are prone to a bias known as the ecological fallacy effect, which can also be called aggregate bias or cross-level bias.2 As an illustration, Gatrell et al3 studied access to tertiary cardiac services in east Lancashire and found that in electoral ward areas with a higher proportion of Asians, there was lower uptake of angiograms and percutaneous transluminal coronary angioplasty operations. However, although they postulated that it was the Asians living in the wards who had the lower uptake of services, it could equally well have been the white people living in those wards. The finding of an ecological association between ethnicity and uptake therefore warranted further investigation at the individual level to confirm this hypothesis. The problem of the ecological fallacy has been discussed in many publications, with possible reasons for its occurrence.4–6

In epidemiology and health services research, the availability of information on risk factors for ecological analysis is limited, particularly at smaller units of aggregation, such as electoral wards. The UK national census has provided socioeconomic information at the enumeration district, ward and local authority district levels, but only one measure of ill health—namely limiting long-term illness—was available in the 1991 census.7 To overcome this, disease rates taken from different sources are often amalgamated with the census socioeconomic information for ecological analysis of other health outcomes.8 As an exception, the Office for National Statistics Longitudinal Study is a specially commissioned survey that links census data to cancer data.9 However, census data do not contain risk factors of lifestyle, smoking, diet and the environment. In large-scale health surveys where this information on risk factors is available—for example, the Health Surveys for England10 or the Health and Lifestyle Survey11—the health outcome of interest may not have been recorded. It would therefore be beneficial to be able to link survey information to other health outcome data, which can be carried out using a stratified ecological model.12 In this model, each areal unit is divided into strata based on age and sex groupings, and information from a survey is used to provide covariate data, cross-tabulated by age and sex, on the people in each stratum. This is possible when survey data include an aggregate-level identifier for each person as in the UK Census Sample of Anonymised Records.13 However, the identifier is usually restricted to the local authority district or higher levels for confidentiality reasons.

The aim of this study was to investigate the potential of the stratified ecological model, compared with the standard ecological model, for reducing bias in ecological studies. Analyses were carried out to examine the associations between socioeconomic risk factors and lung cancer in the north of England.

METHODS

Population

All people aged between 45 and 74 years (3 667 188 people) living in 92 local authority districts in Greater Manchester, Merseyside, Tyne and Wear, Cleveland, Humberside, Cheshire, Cumbria, Durham, Lancashire, Northumberland and Yorkshire in 1991 were included in the analyses.

Sources of data

The covariate data for local authority districts for the standard ecological analysis were extracted from the UK national census datasets for 1991, which are given through the Manchester Information and Associated Services national data centre at Manchester University (found at http://www.mimas.ac.uk/). The main outputs from the 1991 census are tables of aggregate data for constituent areas of Great Britain, called the Small Area Statistics and Local Based Statistics,14 which contain information on households and people enumerated through detailed self-completed questionnaires on the day of census.

The covariate data for the stratified ecological analysis were taken from the census Sample of Anonymised Records (SAR).13 This is a 2% sample drawn from the 1991 census, which has had identifying information removed to protect confidentiality. They are microdata files with a separate record for each person, similar to the data obtained from a sample survey. The SAR covers the full range of census topics including housing, education, health, transport, employment and ethnicity.

Population outcome data on lung cancer registrations were obtained from three regional cancer registries covering the north of England. Regional cancer registries across the UK have been collecting population-based cancer data for the past 40 years and supply data to the Office for National Statistics for the provision of national cancer statistics. All the UK registries collect information on every new diagnosis of cancer occurring in their regional populations. Their main priority has been to ensure a uniform process for registering cancers regionwide, which will deliver timely, comparable and high-quality data. The main sources of registrations are from pathology reports, medical records, radiotherapy records, hospices, independent hospitals, specialist tumour registers, screening services and death certificates.

Socioeconomic risk factors

Six socioeconomic risk factors from the 1991 Census Small Area Statistics (SAS)14 were extracted for all residents aged between 45 and 74 years living in households. These SAS data provided the covariate information for the standard ecological analysis carried out at the local authority district level. The data were in tabulated form aggregated by district and, in some cases, data restrictions meant that the exact subgroup of residents could not be selected—for example, the age range might differ or all residents selected where data were not separately available for those living in households. Each category of a multicategory covariate was represented by a separate variable expressed as the proportion of people falling in that category (eg, proportion who were employed, proportion who were unemployed, proportion who were economically inactive, etc).15 The six covariates, with categories expressed as proportions, were ethnicity (white or non-white), housing tenure (owner-occupier, renting privately or renting from local authority), car ownership (one car, no car, or two or more cars), social class (I+II, III non-manual, III manual or IV+V), employment status (employed, unemployed or economically inactive) and qualifications (qualified with a diploma or degree or unqualified). The reference category was taken to be the first category listed in each case.

The same covariate information was extracted at the individual level from the census 2% SAR.13 These data were used in the stratified ecological analysis to provide more detailed information on the associations between age, sex and socioeconomic status. In the stratified analysis, each local authority district was stratified into 14 age and sex groups. The socioeconomic data were cross-tabulated with age (40–44, 45–49, 50–54, 55–59, 60–64, 65–69 and 70–74 years) and sex (male or female) to obtain a unique covariate proportion for each age and sex grouping in each district. These covariate data were then incorporated into the stratified ecological model (see Statistical analysis). Categorical covariates representing each age and sex group were also included in the model, which enabled an age and sex interaction term to be fitted.

Outcome measure

Population estimates of lung cancer registrations were obtained for all districts in the north of England from three regional cancer registries. The data were provided in a standard aggregate form as cross-tabulations of observed frequency counts by age, sex and district. This mirrored the usual form of outcome data for ecological analysis and required no special permission for their use. As some of the counts were small, amalgamation of some districts was carried out by the registry providers to maintain confidentiality before the data could be released. There were 52 “super districts” remaining after the amalgamations. These data were used to obtain age and sex-specific cancer rates for the north of England to indirectly standardise the disease rates. All new cases of lung cancer registered in the years 1993–6 were analysed in relation to the socioeconomic risk factors taken from the 1991 census. The lag between exposure and disease occurrence was to avoid ill people being socioeconomically reclassified into “unhealthy” categories owing to their being ill.16 For example, in relation to employment, someone who might normally have been in employment may become economically inactive because of their illness. Lung cancer was chosen, as it is one of the major cancer sites, and because there were already known socioeconomic differentials in the incidence of lung cancer17 that would provide an interesting illustration of the method.

Statistical analysis

As some small districts had to be amalgamated to retain the confidentiality of lung cancer registrations, the socioeconomic census data were also amalgamated into 52 super district units for the analyses. The models were fitted by maximisation of the likelihood using standard statistical procedures in STATA V.8.2.

Standard ecological model

For standard ecological analysis, the total number of observed people with lung cancer in each district were regressed on the SAS covariate proportions using a Poisson model of the form,

Embedded Image

where yk is the frequency of developing lung cancer in district k (k = 1, …, K), xjk is the jth (j = 1, …, J) covariate value for district k and βj is the parameter of the jth covariate to be estimated. Illness rates were indirectly standardised taking the north of England as the standard population, to obtain the total number of people to be expected to have lung cancer in each district k, if that district experienced the same age and sex-specific rates of illness as that in the standard population. The log of the expected counts (lnek) was included in the model as an offset. In this model, one observed and expected frequency per district were used. Therefore, the unit of observation was the district, and the covariate proportions for each risk factor were measured at the district level.

Stratified ecological model

In the stratified model, the observed and expected frequency counts were left expanded over each of the 14 age and sex strata used in the indirect standardisation procedure, to give 14 observations/district. The corresponding covariate information was then taken from the SAR cross-tabulations to obtain a unique covariate proportion for each strata in each district. This was carried out using the SAR individual-level data because the SAS does not provide this level of information for every covariate. A similar Poisson model was applied to the data for the stratified ecological analysis as follows

Embedded Image

where yks is the frequency of developing lung cancer in stratum s of district k, xjks is the jth covariate value for stratum s in district k, βj is the parameter of the jth covariate to be estimated and ln eks is the offset term. When age and sex terms are included in the model, the offset could have been simplified to ln(nks), where nks is the number of people in stratum s of district k. This is because the age and sex terms together with an age and sex interaction should, in theory, be able to adjust for any age and sex differences in lung cancer rates, and hence standardisation should not be necessary. However, for ease of comparison between models, ln(eks) is used as the offset throughout this paper.

RESULTS

Table 1 displays the characteristics of the SAS and SAR samples. It illustrates the restrictions of the SAS data for some variables with respect to differing denominator populations and age groups, causing some slight discrepancies in prevalence between the samples with respect to housing tenure, car ownership and social class.

Table 1

 Characteristics of the Small Area Statistics (SAS) and the Sample of Anonymised Records (SAR) samples used in the standard and stratified ecological analyses of lung cancer

Table 2 gives the results of the lung cancer data regressed on socioeconomic covariate proportions for the 52 super districts, comparing the standard and stratified ecological analyses. As the cancer rates were small in all age groups, Poisson models were applied. For comparability of deviances, the standard ecological model deviance was recalculated on an expanded dataset, where each age and sex stratum had the same covariate value repeated over the 14 categories. This created a data structure similar to that used for the stratified model containing (52×14) 728 observations, and provided identical parameter estimates and standard errors to the collapsed model containing 52 observations.

Table 2

 Rate ratios of lung cancer for years 1993–6 by standard and stratified ecological regression

The rate ratio (RR) results of the standard ecological analysis suggest that districts with a higher proportion of people living in local authority rented accommodation, with a higher proportion of people in social class III non-manual, or with a higher proportion of unemployed people, had a higher risk of lung cancer than the respective reference category. Districts with a higher proportion of non-white people, or with a higher proportion of people living in private rented accommodation, had a decreased risk of lung cancer, and districts with a higher proportion of non-car owners or two car owners, with a higher proportion of people in social classes III manual or social class IV and V, with a higher proportion of inactive or unqualified people had no increased risk of lung cancer compared with the reference category. In the stratified ecological analysis, the spuriously large effect for unemployment now reduced considerably, as did that for the social class III non-manual group, such that the social class now showed more of an increased gradient in risk. In addition, districts with a higher proportion of non-car owners now had a significantly increased risk of lung cancer and districts with a higher proportion of inactive people, a significantly decreased risk. The effects for the non-white and rent privately groups became insignificant, indicating no increased risk in lung cancer for white compared with non-white people or for those renting privately compared with owner occupiers. The effects of the two-car ownership and unqualified groups remained insignificant in both analyses.

A sensitivity analysis of men and women separately using the stratified approach identified that a social class gradient was more apparent in the analysis for men (III non-manual RR 2.28, 95% CI 1.38 to 3.75; III manual RR 1.68, 95% CI 1.24 to 2.28; IV and V RR 2.80, 95% CI 2.04 to 3.84) compared with women (III non-manual RR 1.57, 95% CI 1.08 to 2.27; III manual RR 1.03, 95% CI 0.57 to 1.87; IV and V RR 1.63, 95% CI 1.12 to 2.36). Also the risk of lung cancer in districts with a higher proportion of economically inactive men (RR 0.92, 95% CI 0.63 to 1.35) was not significantly different from the employed group, and districts with a higher proportion of women in privately rented accommodation had an increased risk of lung cancer (RR 3.71, 95% CI 1.63 to 8.45) compared with the owner occupier group. In all other respects, the results were similar to those for the combined analysis.

DISCUSSION

The relationships described by the standard ecological model show some exaggerated and spurious associations, which are counterintuitive to known socioeconomic differentials for lung cancer.17 The results show how insignificant effects for non-car ownership, social class III manual, social class IV and V, and socioeconomic inactivity in the standard ecological analysis became significant in the stratified ecological analysis. The larger positive effects, respectively, for social class III non-manual and unemployment groups also considerably reduced in the stratified analysis. Although there is no clear explanation for these large effects in the standard model, we would suggest that they are somehow being confounded with the age and sex effects, which were separated out in the stratified model. The stratified ecological approach also enabled a sensitivity analysis by sex to compare results with the findings of Kogevinas17 for earlier years. Kogevinas studied cancer incidence (as well as survival) using the Office for National Statistics Longitudinal Study for the years 1971–83. He showed marked socioeconomic differentials in lung cancer incidence in men more socioeconomically disadvantaged for housing tenure (standardised incidence ratios (SIR) 138 for council tenants, 116 for those privately renting and 75 for owner occupiers), social class (SIR 48, 77, 86, 105, 116, 124 for classes I, II, IIIN, IIIM, IV and V, respectively) and employment (SIR 150 for unemployed, 102 retired, and 96 for employed). No results were presented for the other socioeconomic variables considered in our study. For women, the differentials were less marked but significant for housing tenure (SIR 122 for council tenants, 111 for those privately renting and 83 for owner occupiers), non-significant in manual compared with non-manuals jobs, and economic position was not reported. Although these results were not for the same time period as our study they do show the trends in socioeconomic disadvantage that we might expect in our study and that have been shown for other disease outcomes.18–21 The results of the sensitivity analyses in particular are broadly supported by these findings.

The datasets used in this study were generated from population registers, where completeness of ascertainment can be an issue. Coverage of the national population in the 1991 census is estimated to be 98%. Although people in less advantaged socioeconomic groups tend not to answer inquiries such as this, and therefore prevalence estimates may be underestimated, their omission would probably not have greatly affected the results, as the focus here was on associations between variables, and the findings showed trends in the socioeconomically disadvantaged that were consistent with previous work. The discrepancies between the two census samples shown in Table 1, due to differences in denominator populations, were small and therefore were also likely to have had minimal effect on our findings. This was confirmed in a sensitivity analysis of social class, where it was calculated using SAR data for both the head of the household’s social class and the person’s own social class, and the findings remained robust. Cancer registries in the UK provide the best source of population data for the study of specific cancers. Incompleteness of cancer registry data, when apparent, is generally due to a breakdown in reporting procedures and not to individual patient attributes. In general, ascertainment is high, and there is no tendency for inaccurate registration to occur in specific regions.17

In this analysis, the covariate information was limited to the variables available in the census SAS and SAR. No data on smoking were available—for example, a known risk factor for lung cancer. This highlights the advantage of the stratified ecological method in incorporating individual-level data from other large-scale surveys in which information on smoking may have been collected. In this respect, it is likely that some of the socioeconomic variables here acted as a proxy for smoking—for example, with those in the lower social classes being more likely to smoke than those from less disadvantaged groups.17 In this study, we only fitted a simple age and sex interaction term. It could be argued that this interaction was not needed, because an offset term was included in the model and most of these effects were non-significant. However, there were significant interaction effects for the two older age groups, indicating the necessity for a correction factor to adjust for variation in the socioeconomic variables by age and sex not accounted for by the offset. The inclusion of the interaction term also illustrates the potential of the method to fit more complex interactions with other risk factors if required.

Therefore, there are several advantages in using the stratified ecological model. Firstly, it provides a more detailed analysis that takes into account population differences in the age and sex structure of the area through the strata, which is not possible in a standard ecological model where only the overall disease rate ratio for each area is known. Secondly, it is able to incorporate individual-level survey information into an ecological analysis that opens up the way for taking additional covariate information from large-scale surveys, such as those held on the Economic and Social Research Council’s Data Archive, which contain a district-level identifier. This type of analysis will typically only be feasible at the district level, as access to survey risk factor information at smaller geographical units will usually breach confidentiality. Smaller aggregate units are preferred whenever possible to further reduce bias.6,12 Thirdly, by leaving the illness rates expanded over the age and sex strata, interaction terms can be incorporated into the model between age and sex, and the other risk factors, giving a more flexible model. The socioeconomic variables can even be summarised into a “deprivation” score to facilitate interpretation of more complex interactions.15 It is important to note, however, that the model has the potential to reduce ecological bias but not totally irradicate it, and therefore results should still be treated with caution as some ecological bias will remain. Associations suggested at the aggregate level can only be confirmed through large-scale epidemiological studies, such as cross-sectional surveys, case–control or cohort studies, conducted on individual people.

Several other methods for reducing bias in ecological studies have been proposed in the literature. In particular, Cleave et al22 reviewed four methods using examples of voting transitions between two different elections at the ward level, and advocated the aggregated compound multinomial model. Lancaster et al12 evaluated this method in comparison to two other potential methods, endorsing the stratified approach used in this study. They also reviewed the aggregated individual-level model, proposed by Prentice and Sheppard.23 This model is appealing, as it too can combine data from population disease registries with individual-level survey data. However, most examples found in the literature have been carried out on simulated data,24,25 and where empirical results have been obtained they have been shown to have convergence problems.12 Tranmer and Steel26 presented methods using SAR data to provide adjusted correlation coefficients at the aggregate level, with adjustments made using individual-level variables that explained much of the within-area homogeneity. A Bayesian hierarchical modelling approach has also been implemented for modelling spatial dependence in disease rates in ecological regression.27 However, these are fairly complex procedures and are not routinely used by epidemiologists.

In conclusion, stratified ecological analysis incorporating individual-level covariate information reduced the bias seen in a standard ecological analysis. It is straightforward to apply and allows the linkage of health data with data from any large-scale complex survey, where district of residence is known. Further empirical examples are needed to verify its potential in ecological regression.

What is already known

It is well known that standard ecological regression is prone to ecological bias when results from this type of analysis are used to make inferences about the people living in geographically defined areas.

What this study adds

  • Stratified ecological regression is a method for reducing bias in ecological studies.

  • It has the advantage of being able to link area-level health outcome data with individual-level information on risk factors from large-scale surveys that include an area-level identifier; it allows age and sex interaction terms to be fitted, it is straightforward to apply and reduces ecological bias in our example.

Policy implications

  • Valuable health service resources and interventions are targeted at people in most need, but identification of vulnerable groups is difficult.

  • Ecological analysis is a useful first step at identifying associations between areas containing a higher percentage of people who are socioeconomically disadvantaged people and disease outcomes, but these analyses may misconstrue the relationships.

  • Methods for bias reduction including the one reported in this paper therefore make an important contribution to eliminating spurious associations and to identifying target groups within areas for further study.

REFERENCES

Footnotes

  • Funding: This project was funded by the Economic and Social Research Council (ESRC) grant number RES-000-22-0143. Lung cancer data were provided by three regional cancer registries covering the Northern and Yorkshire region, the North West, and Merseyside and Cheshire. The SAS and SAR are Crown copyright and supplied by the Census Microdata Unit at the University of Manchester, with the support of the ESRC Joint Information Systems Committee.

  • Competing interests: None.

Linked Articles

  • In this issue
    Carlos Alvarez-Dardet John R Ashton