Original articles
Analysis of Case-Cohort Designs

https://doi.org/10.1016/S0895-4356(99)00102-XGet rights and content

Abstract

The case-cohort design is most useful in analyzing time to failure in a large cohort in which failure is rare. Covariate information is collected from all failures and a representative sample of censored observations. Sampling is done without respect to time or disease status, and, therefore, the design is more flexible than a nested case-control design. Despite the efficiency of the methods, case-cohort designs are not often used because of perceived analytic complexity. In this article, we illustrate computation of a simple variance estimator and discuss model fitting techniques in SAS. Three different weighting methods are considered. Model fitting is demonstrated in an occupational exposure study of nickel refinery workers. The design is compared to a nested case-control design with respect to analysis and efficiency in a small simulation. In this example, case-cohort sampling from the full cohort was more efficient than using a comparable nested case-control design.

Introduction

Large cohort designs with few observed failures may require enormous resources to ascertain covariate values. Case-cohort designs can reduce data collection by efficiently sampling the censored (nondiseased) individuals. Unlike the nested case-control design, sampling is done a priori without regard to case status or time. While conceptually simple, the analysis of a case-cohort design is nontrivial and may be daunting if relying on the statistical literature for guidance. This article describes the techniques needed to fit such models and the software that can be used.

Analysis of case-cohort data resembles a standard Cox [1] regression approach with some modification. We assume that if data on the full cohort were available, then a standard Cox regression analysis would be used. Observed failures are typically more influential on the parameter estimates than censored observations. Accordingly, Prentice [2] proposed the case-cohort design to analyze cohort data efficiently when most observations are censored. Conceptually, a random sample of the cohort, or “subcohort,” is designated prospectively as the source of comparison observations for the observed failures. All failures are included whether they occur in the random sample or not, but censored observations are included only if in the subcohort.

The design appears to be very efficient because controls can be used in all risk sets for which they qualify. Furthermore, as the random sample subcohort is chosen without regard to outcome, several failure time outcomes can be analyzed with the same comparison group. Despite these advantages, the design is used infrequently in practice, with most investigators choosing nested case-control designs instead. For the period 1990–1998 MEDLINE shows 484 occurrences of the keywords “nested case-control” versus 55 occurrences of “case-cohort.” Some investigators may have been deterred by the difficult variance estimation and lack of software. Others may have been influenced by arguments that nested case-control designs may be more statistically efficient in some circumstances 3, 4. Finally, others may be simply unaware of the design and its potential advantages.

This article describes methods of analysis including different weighting schemes used in estimation. We describe how a robust covariance matrix is computed to give standard errors for the parameter estimates. Details of how to fit the models in standard software are given. The nickel refinery dataset described in Breslow and Day [5] is used to illustrate both the case-cohort and nested case-control designs. This is an occupational cohort with staggered entry and fixed covariates. We perform a small simulation that considers different sampling fractions and shows how the estimated βs, standard errors, and efficiency vary with the analytic method.

Section snippets

Case-cohort design

The term “case-cohort” was coined by Prentice [2] to describe a design that is a cross between a cohort design and a case-control design, incorporating the best features of both. The design was actually proposed earlier by Miettinen [6] and called the “case-base” design, but Prentice extended the design to include failure time analysis. We describe the case-cohort design as if a prospective study was being conducted, although in many cases a retrospective study is actually performed. Consider a

Case-cohort analysis

Consider a proportional hazards model with no ties among the observed failure times. For individual i at time t, let zi(t) be the covariate vector (possibly time-dependent), and let Yi(t) indicate whether person i is at risk at time t. We assume a standard exponential form for the relative risk. If covariates are evaluated on everyone, a standard Cox model is used. If person i fails at time tj, then the contribution to the partial likelihood, assuming no tied failure times, is Yitjezitjβk=1nYkt

Robust estimation of the variance

The score contributions from the pseudolikelihood maximization are not independent owing to the method of sampling [2]. Intuitively, the correlation arises because a case outside the subcohort suddenly appears at its own failure time and was not previously included in the earlier failure times. Consequently, martingale theory cannot be directly applied and more complicated asymptotics are required 2, 7. Prentice [2] proposed a variance estimator that corrects for this correlation among score

Design considerations

In the introduction the case-cohort design was described for a prospective study with everyone entering at time 0. The design is also applicable to open cohorts with staggered entry. In this case it is necessary to have each new member of the cohort have probability α of being a member of the subcohort. The exact variance estimate of Prentice [2] is difficult to compute in open cohorts, so the robust variance estimate is preferred.

In some cases, the subcohort may become small after many

Software for modeling case-cohort data

Some recent changes in software make fitting case-cohort data much easier. Improvements in S-Plus and the SAS procedure PHREG now allow direct modeling with appropriate construction of the dataset. We have written an SAS macro that computes the weighted estimates and the robust covariance matrix. This macro is available on the internet through Statlib (http://lib.stat.cmu.edu/general/robphreg). The program allows any of the three weighting schemes discussed here.

To use the SAS macro, it is

Comparison to the nested case-control design

The nested case-cohort design is often used in the same setting that a case-cohort design might be used. After all outcomes are determined, risk sets are formed at each failure time that enumerate all controls at risk at that same time point [12]. For a 1:m nested design, m controls are selected at random from those available at that time point. Within the risk set, sampling is without replacement, but individuals are selected with replacement across time points. Individuals are included only

Example

As an example, consider the Welch nickel refinery workers and subsequent development of nasal cancer. Breslow and Day [5, p. 223] show an analysis that uses “years since first employed” as a time axis. We consider the same model using the four continuous variables: (1) log (age at first employment—10 years old); (2) (year first employed—1915)/10; (3) (year first employed—1915)2/100; and (4) log (exposure + 1). The full cohort results are shown at the bottom of Table 5.10 in Breslow and Day [5].

Simulation results

Suppose that rather than obtain covariate information for the entire nickel refinery cohort, sampling of the cohort was performed. We conducted a small-scale simulation comparing the sampling methods under different sampling fractions. For each sampling scheme, 200 samples were drawn from the full cohort and an analysis with four continuous covariates was performed. Our interest centers on the parameter estimate of log exposure, its standard error, and the associated hypothesis test.

For the

Discussion

The limited simulation suggests that the case-cohort design may be more efficient in some applications than the nested case-control design and that the unweighted analysis may be preferable. The weighted analysis may be appealing intuitively, but it could be biased away from the null hypothesis.

Langholz and Thomas 3, 4 reported that the efficiency of the case-cohort design compared with the nested case-control design was less than expected in standard survival analyses and could even be

Acknowledgements

This research was supported in part by National Cancer Institute grants CA61114 and CA63731 and the Centers for Disease Control RFP 200-95-0947. The SAS program is modeled on SAS Version 6.10 sample program PHR610EX.SAS. SAS procedure PHREG is described in the SAS Institute publication SAS/STAT Software: Changes and Enhancements through Release 6.11, 1996. The S-Plus function coxph is described in Splus4: Guide to Statistics, 1997, from Mathsoft Corporation in Seattle, WA. The Epicure program

References (25)

  • V.L. Ernster

    Nested case-control studies

    Prev Med

    (1994)
  • P.A. Van den Brandt et al.

    A large scale prospective cohort study on diet and cancer in The Netherlands

    J Clin Epidemiol

    (1990)
  • D.R. Cox

    Regression models and life tables

    J R Stat Soc Series B

    (1972)
  • R.L. Prentice

    A case-cohort design for epidemiologic cohort studies and disease prevention trials

    Biometrika

    (1986)
  • B. Langholz et al.

    Nested case-control and case-cohort methods of sampling from a cohortA critical comparison

    Am J Epidemiol

    (1990)
  • B. Langholz et al.

    Efficiency of cohort sampling designsSome surprising results

    Biometrics

    (1991)
  • N.E. Breslow et al.

    Statistical Methods in Cancer Research Volume 2The Design and Analysis of Cohort Studies

    (1987)
  • O.S. Miettinen

    Design options in epidemiologic researchAn update

    Scand J Work Environ Health

    (1982)
  • S.G. Self et al.

    Asymptotic distribution theory and efficiency results for case-cohort studies

    Ann Stat

    (1988)
  • W.E. Barlow

    Robust variance estimation for the case-cohort design

    Biometrics

    (1994)
  • D.Y. Lin et al.

    Cox regression with incomplete covariate measurements

    J Am Stat Assoc

    (1993)
  • W.E. Barlow et al.

    Residuals for relative risk regression

    Biometrika

    (1988)
  • Cited by (0)

    View full text