Variable selection and Bayesian model averaging in case-control studies

Stat Med. 2001 Nov 15;20(21):3215-30. doi: 10.1002/sim.976.

Abstract

Covariate and confounder selection in case-control studies is often carried out using a statistical variable selection method, such as a two-step method or a stepwise method in logistic regression. Inference is then carried out conditionally on the selected model, but this ignores the model uncertainty implicit in the variable selection process, and so may underestimate uncertainty about relative risks. We report on a simulation study designed to be similar to actual case-control studies. This shows that p-values computed after variable selection can greatly overstate the strength of conclusions. For example, for our simulated case-control studies with 1000 subjects, of variables declared to be 'significant' with p-values between 0.01 and 0.05, only 49 per cent actually were risk factors when stepwise variable selection was used. We propose Bayesian model averaging as a formal way of taking account of model uncertainty in case-control studies. This yields an easily interpreted summary, the posterior probability that a variable is a risk factor, and our simulation study indicates this to be reasonably well calibrated in the situations simulated. The methods are applied and compared in the context of a case-control study of cervical cancer.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Analysis of Variance
  • Bayes Theorem*
  • Biometry*
  • Case-Control Studies*
  • Female
  • Humans
  • Odds Ratio
  • Risk Factors
  • Uterine Cervical Neoplasms / epidemiology