Original Article
Post hoc choice of cut points introduced bias to diagnostic research

https://doi.org/10.1016/j.jclinepi.2005.11.025Get rights and content

Abstract

Background and Objective

To examine the extent of bias introduced to diagnostic test validity research by the use of post hoc data driven analysis to generate an optimal diagnostic cut point for each data set.

Methods

Analysis of simulated data sets of test results for diseased and nondiseased subjects, comparing data driven to prespecified cut points for various sample sizes and disease prevalence levels.

Results

In studies of 100 subjects with 50% prevalence a positive bias of five percentage points of sensitivity or specificity was found in 6 of 20 simulations. For studies of 250 subjects with 10% prevalence a positive bias of 5% was observed in 4 of 20 simulations.

Conclusion

The use of data-driven cut points exaggerates test performance in many simulated data sets, and this bias probably affects many published diagnostic validity studies. Prespecified cut points, when available, would improve the validity of diagnostic test research in studies with less than 50 cases of disease.

Introduction

The principle of testing an a priori hypothesis is well established in research methodology as one of the required elements of a strong research design [1]. For example, in clinical trials it is good practice to prespecify the outcome measures, the degree of difference that is expected, and any subgroup analyses that will be undertaken. These methodologic standards avoid the pitfalls of a “data-driven” analysis in which the comparisons can be chosen to give the study results favored by the investigators; however, the conventional analysis of diagnostic research is just such a data-driven analysis.

The problem was eloquently described by Sackett: “Another threat to the validity of estimates of accuracy generated in phase III studies arises whenever the selection of the ‘upper limit of normal’ or cutoff point for the diagnostic test is under the control of the investigators. When the investigators are free to place the cutoff point wherever they wish, it is natural for them to place it where it maximises sensitivity, specificity, or the total number of patients correctly classified in that particular ‘training’ set of patients” [2].

Research into diagnostic methods is a field that lags behind research into interventions, with high-quality research approaches becoming formalized and adopted only in recent decades, assisted by such efforts as the STARD initiative [3], published in 2003. The basic design of diagnostic research is to use the candidate test and a “gold standard” test on each of a group of subjects who might have disease to generate values for the sensitivity and specificity of the new test. If the result of the candidate test is a continuous variable, then a cut point for defining abnormality must determined. The usual method to determine a cut point is to try all possible cut points and use the one that gives the greatest sum of sensitivity plus specificity. The post hoc nature of this analysis suggests that the test performance determined with a data-driven optimal cut point will be better than that determined through use of a previously derived cut point if one is available from prior research. This problem has been referred to in the STARD item 9, although to my knowledge the size of such a bias has not been established.

The problem of data-driven analysis leading to unreproducible findings is well described for multivariate predictive models [4], [5], along with methods for data resampling such as bootstrapping, that go some way to detecting the problem. These methods are of some use, but rarely seen in, the analysis of diagnostic validity studies. It has been suggested that it is advisable to perform one or more separate external validation studies in independent but clinically similar populations [6]; however, in the diagnostic research literature it is rare to find any “confirmatory” studies, with most authors choosing cut points based on their own current data as if it were the first study of the question.

The best method for choosing a diagnostic cut point for clinical or screening situations comes from decision analytic theory, and includes not only the discriminatory power of the test but also the loss incurred through false positive and false negative results [7]; however, my work addresses the common situation where the negative value of false results have not been quantified.

I set out to examine the extent of bias introduced to diagnostic research by the use of post hoc data-driven analysis by comparing it to a prespecified analysis in simulated data sets.

Section snippets

Method

Sets of hypothetic test results for diseased and nondiseased populations were generated as normally distributed random numbers with known mean and standard deviation. The disease set had a mean value of 50, the normal set had a mean value of 30, while both sets had a standard deviation of 10. Simulations were run first for 20 studies with a prevalence of 50% and a total of 100 subjects, and for various combinations of sample size and prevalence. Calculations were performed in Microsoft Excel 97

Results

The results of simulations of studies with 100 subjects and 50% prevalence are shown in Table 1.

Each row of the table represents one simulated diagnostic test validity study, for which the result was determined by optimizing the cut point for the maximum sum of sensitivity plus specificity (shown on the left-hand side of the table) and against the prespecified cut point of 40 (shown on the right-hand side of the table). The difference between these two analyses for each study is in the

Discussion

These simulations demonstrate that there is a small positive bias in favor of test accuracy caused by the use of data driven analysis in many studies of this size. Where the smallest column total is 50 or less a data driven analysis should be viewed as subject to positive bias. This will occur in studies with a small sample size, or studies with a large sample size but low prevalence of disease. The positive bias is less in studies with bigger samples, because such studies have increased

Conclusion

The post hoc derivation of a diagnostic threshold can introduce a small bias into diagnostic test validity studies if the number of cases is smaller than about 50. This bias always works in the direction of increased test performance. It could be avoided by the use of cut points derived from previous work if good quality prior studies are available. The effects of this bias could be further explored in meta analyses by searching for an association between the method of choosing a cut point and

Acknowledgments

Helpful comments on the ideas in this article were made by Dr. John Attia, Dr. Daniel Ewald, Prof. Wayne Smith, and Dr. Kate D'Este. Dr B. Ewald is a PhD scholar supported by the National Health and Medical Research Council.

References (10)

There are more references available in the full text version of this article.

Cited by (64)

  • Data-driven methods distort optimal cutoffs and accuracy estimates of depression screening tools: a simulation study using individual participant data

    2021, Journal of Clinical Epidemiology
    Citation Excerpt :

    No studies attributed a divergent optimal cutoff to a small sample size or to data-driven cutoff selection methods (Appendix-eMethods1). We know of only four studies that have investigated the degree to which data-driven selection of cutoff may influence diagnostic accuracy estimates [15-18]. These studies each reported that data-driven cutoff selection produces overly optimistic estimates, particularly in small samples.

  • Cardiovascular biomarkers in the evaluation of patent ductus arteriosus in very preterm neonates: A cohort study

    2020, Early Human Development
    Citation Excerpt :

    It is a strength that we were able to use pre-specified cut points for the evaluation of biomarkers NT-proBNP and MR-proANP. As it has been shown that use of data-driven cut points tend to exaggerates test performance [62]. Neonates that were diagnosed with a PDA were born at a lower gestational age compared to neonates with no PDA.

View all citing articles on Scopus
View full text