Original ArticlePost hoc choice of cut points introduced bias to diagnostic research
Introduction
The principle of testing an a priori hypothesis is well established in research methodology as one of the required elements of a strong research design [1]. For example, in clinical trials it is good practice to prespecify the outcome measures, the degree of difference that is expected, and any subgroup analyses that will be undertaken. These methodologic standards avoid the pitfalls of a “data-driven” analysis in which the comparisons can be chosen to give the study results favored by the investigators; however, the conventional analysis of diagnostic research is just such a data-driven analysis.
The problem was eloquently described by Sackett: “Another threat to the validity of estimates of accuracy generated in phase III studies arises whenever the selection of the ‘upper limit of normal’ or cutoff point for the diagnostic test is under the control of the investigators. When the investigators are free to place the cutoff point wherever they wish, it is natural for them to place it where it maximises sensitivity, specificity, or the total number of patients correctly classified in that particular ‘training’ set of patients” [2].
Research into diagnostic methods is a field that lags behind research into interventions, with high-quality research approaches becoming formalized and adopted only in recent decades, assisted by such efforts as the STARD initiative [3], published in 2003. The basic design of diagnostic research is to use the candidate test and a “gold standard” test on each of a group of subjects who might have disease to generate values for the sensitivity and specificity of the new test. If the result of the candidate test is a continuous variable, then a cut point for defining abnormality must determined. The usual method to determine a cut point is to try all possible cut points and use the one that gives the greatest sum of sensitivity plus specificity. The post hoc nature of this analysis suggests that the test performance determined with a data-driven optimal cut point will be better than that determined through use of a previously derived cut point if one is available from prior research. This problem has been referred to in the STARD item 9, although to my knowledge the size of such a bias has not been established.
The problem of data-driven analysis leading to unreproducible findings is well described for multivariate predictive models [4], [5], along with methods for data resampling such as bootstrapping, that go some way to detecting the problem. These methods are of some use, but rarely seen in, the analysis of diagnostic validity studies. It has been suggested that it is advisable to perform one or more separate external validation studies in independent but clinically similar populations [6]; however, in the diagnostic research literature it is rare to find any “confirmatory” studies, with most authors choosing cut points based on their own current data as if it were the first study of the question.
The best method for choosing a diagnostic cut point for clinical or screening situations comes from decision analytic theory, and includes not only the discriminatory power of the test but also the loss incurred through false positive and false negative results [7]; however, my work addresses the common situation where the negative value of false results have not been quantified.
I set out to examine the extent of bias introduced to diagnostic research by the use of post hoc data-driven analysis by comparing it to a prespecified analysis in simulated data sets.
Section snippets
Method
Sets of hypothetic test results for diseased and nondiseased populations were generated as normally distributed random numbers with known mean and standard deviation. The disease set had a mean value of 50, the normal set had a mean value of 30, while both sets had a standard deviation of 10. Simulations were run first for 20 studies with a prevalence of 50% and a total of 100 subjects, and for various combinations of sample size and prevalence. Calculations were performed in Microsoft Excel 97
Results
The results of simulations of studies with 100 subjects and 50% prevalence are shown in Table 1.
Each row of the table represents one simulated diagnostic test validity study, for which the result was determined by optimizing the cut point for the maximum sum of sensitivity plus specificity (shown on the left-hand side of the table) and against the prespecified cut point of 40 (shown on the right-hand side of the table). The difference between these two analyses for each study is in the
Discussion
These simulations demonstrate that there is a small positive bias in favor of test accuracy caused by the use of data driven analysis in many studies of this size. Where the smallest column total is 50 or less a data driven analysis should be viewed as subject to positive bias. This will occur in studies with a small sample size, or studies with a large sample size but low prevalence of disease. The positive bias is less in studies with bigger samples, because such studies have increased
Conclusion
The post hoc derivation of a diagnostic threshold can introduce a small bias into diagnostic test validity studies if the number of cases is smaller than about 50. This bias always works in the direction of increased test performance. It could be avoided by the use of cut points derived from previous work if good quality prior studies are available. The effects of this bias could be further explored in meta analyses by searching for an association between the method of choosing a cut point and
Acknowledgments
Helpful comments on the ideas in this article were made by Dr. John Attia, Dr. Daniel Ewald, Prof. Wayne Smith, and Dr. Kate D'Este. Dr B. Ewald is a PhD scholar supported by the National Health and Medical Research Council.
References (10)
- et al.
Assessment of the accuracy of diagnostic tests: the cross-sectional study
J Clin Epidemiol
(2003) - et al.
Optimal cut-points when screening for more than one disease state: an example from the canadian study of health and aging
J Clin Epidemiol
(1996) - et al.
Principles and procedures of statstics
- et al.
Evidence base of clinical diagnosis: the architecture of diagnostic research
BMJ
(2002) The STARD staement for reporting studies of diagnostic accuracy: explanation and elaboration
Clin Chem
(2003)
Cited by (64)
Data-driven methods distort optimal cutoffs and accuracy estimates of depression screening tools: a simulation study using individual participant data
2021, Journal of Clinical EpidemiologyCitation Excerpt :No studies attributed a divergent optimal cutoff to a small sample size or to data-driven cutoff selection methods (Appendix-eMethods1). We know of only four studies that have investigated the degree to which data-driven selection of cutoff may influence diagnostic accuracy estimates [15-18]. These studies each reported that data-driven cutoff selection produces overly optimistic estimates, particularly in small samples.
Cardiovascular biomarkers in the evaluation of patent ductus arteriosus in very preterm neonates: A cohort study
2020, Early Human DevelopmentCitation Excerpt :It is a strength that we were able to use pre-specified cut points for the evaluation of biomarkers NT-proBNP and MR-proANP. As it has been shown that use of data-driven cut points tend to exaggerates test performance [62]. Neonates that were diagnosed with a PDA were born at a lower gestational age compared to neonates with no PDA.
Predictive validity of psychosis risk models when applied to adolescent psychiatric patients
2023, Psychological MedicineVisuoconstructional Abilities of Patients With Subjective Cognitive Decline, Mild Cognitive Impairment and Alzheimer’s Disease
2023, Journal of Geriatric Psychiatry and Neurology