CommentaryEverything You Always Wanted to Know About Evaluating Prediction Models (But Were Too Afraid to Ask)
Section snippets
A Nomogram is Not the Same Thing as a Prediciton Model
Urology researchers commonly use the terms nomogram and prediction model interchangeably. For example, authors of papers about prediction models often describe their aims as “to develop a nomogram.” But a nomogram is simply a graphic calculating device. For example, the first nomogram one of us saw was to calculate sample sizes for a clinical trial; other well-known nomograms include those for weather forecasting and electrical engineering. Urology researchers typically analyze datasets using
Predictive Accuracy is Not the Same as Discrimination
It is routine to see investigators make claims such as “our model has a predictive accuracy of 70.2%.” The 70.2% figure is typically either an area-under-the-curve (AUC) or a concordance index (C-index). Both the AUC and the C-index provide an estimate of the probability that the model will correctly identify which of two individuals with different outcomes actually has the disease (eg, AUC of a model to predict prostate cancer on biopsy) or had the event sooner (eg, C-index of a model to
Discrimination Depends on How Much the Predictors Vary
The C-index or AUC depends critically on the variation of the predictors in the study cohort. As a simple illustration, imagine that 2 models for life expectancy were published, both of which only included age as a predictor. In addition, both models reported exactly the same odds ratio of 1.10 for a one-year increase in age. However, one model was tested on a group of men aged 50 to 60 years and the other on a group of men aged 40 to 70 years. The C-index would be less than 0.60 for the first
Internal Validation Often Does Not Help Much
The discrimination and calibration of a model are assessed by applying it to a dataset and comparing predictions with outcome. Very commonly, the dataset to which the model is applied is the same as the one used to create the model in the first place. This is known as “internal validation,” and is somewhat problematic for two reasons. First, models tend to have above average fit to the data on which they were created simply because of the play of chance. As a trivial example, if the recurrence
The Concordance Index is Heavily Influenced by Length of Follow-Up
It is easier to predict what is going to happen tomorrow than in 5 years time. This means that a C-index must be interpreted in the context of follow-up time. As a simple example, imagine that Brown and Smith had each published separate nomograms predicting life expectancy, with the C-index for Brown being somewhat higher. The typical interpretation would be that Brown's model is superior. But it might also be that Smith's follow-up was longer. To illustrate this point, we looked at the
Calibration is Not an Inherent Property of a Prediction Tool
Investigators commonly describe the results of their study as showing that their model was “well calibrated.” But it is perfectly possible for a model to be well calibrated on one dataset and poorly calibrated on another. As a simple example, an investigator might develop a predictive tool for life expectancy and demonstrate that predicted risks are close to observed proportions in a variety of European and US datasets. However, the model might well have poor calibration when applied to a
A Good Model Can Be Clinically Useless, A Poor Model Very Valuable
Take the case of a model to predict the outcome of prostate cancer biopsy. The model might have good discrimination (e.g., 0.85) and perfect calibration, yet might have no clinical role. This could be because, for example, the predicted risks from the model ranged from 50% to 95%. Patients with prostate cancer tended to have higher predicted risks—leading to good discrimination—but no patient was given a probability low enough that a urologist would forgo biopsy. To put it another way,
A Predictor That Adds Accuracy to a Prediction Model May Not Be Worth Measuring
Just as it is possible for a model to be accurate, but useless, it is possible for a predictor to improve accuracy, but for this to make no difference to clinical decision making. Again, decision analytic techniques are required to investigate the clinical implications of using the marker.
Just Because You Can Create a Predictive Model Does Not Mean That You Should
Figure 1 shows a nomogram to predict the probability that a patient is aged over 70 on the basis of stage, grade, prostate-specific antigen, and treatment chosen (surgery vs radiotherapy). The model has high discrimination (AUC of 0.78) and good calibration (see Fig. 2). In other words, the model is terrific in all ways other than that it is completely useless. So why did we create it? In short, because we could: we have a dataset, and a statistical package, and add the former to the latter,
Conclusion
In conclusion, papers on nomograms often involve esoteric statistics—restricted cubic splines, for example, or bootstrap resampling—in an effort to fine tune models. But small differences in follow-up, or in the heterogeneity of a predictor, can result in very large differences in C-index or AUC. Moreover, minor differences between cohorts can similarly lead to substantial differences in predictive accuracy between internal and external validation.
Researchers are often told to KISS: “keep it
References (9)
- et al.
What is a real nomogram?
Semin Oncol
(2010) - et al.
Why can't nomograms be more like Netflix?
Urology
(2010) - et al.
Traditional statistical methods for evaluating prediction models are uninformative as to clinical value: towards a decision analytic framework
Semin Oncol
(2010) Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating
(2009)
Cited by (82)
Differential impacts of initial treatment status on long-term survival in patients with sarcomas treated in a referral center according to histologic type and anatomic site
2023, European Journal of Surgical OncologyOutcome prediction following radical nephroureterectomy for upper tract urothelial carcinoma
2021, Urologic Oncology: Seminars and Original InvestigationsHow to Develop Statistical Predictive Risk Models in Oncology Nursing to Enhance Psychosocial and Supportive Care
2020, Seminars in Oncology NursingCitation Excerpt :It is worth noting that although these traditional performance statistics assess accuracy, they provide no information the consequences of using the model in practice. Newer techniques such as decision curve analysis (DCA) can be used to assess the potential benefits and costs of using the model, by evaluating the clinical consequence of using the model to intervene or not.98,99 Importantly DCA provides an estimate of net benefit, which combines the number of true-positives and false-positives, weighted by a factor of false-positives relative to false-negatives, into a net number (similar to net profit) in which the larger the number the better the model.
Early detection of sepsis utilizing deep learning on electronic health record event sequences
2020, Artificial Intelligence in Medicine
Supported in part by funds from David H. Koch provided through the Prostate Cancer Foundation, the Sidney Kimmel Center for Prostate and Urologic Cancers and P50-CA92629 SPORE grant from the National Cancer Institute to Dr. P. T. Scardino. From the Memorial Sloan-Kettering Cancer Center, New York, NY