Elsevier

Urology

Volume 76, Issue 6, December 2010, Pages 1298-1301
Urology

Commentary
Everything You Always Wanted to Know About Evaluating Prediction Models (But Were Too Afraid to Ask)

https://doi.org/10.1016/j.urology.2010.06.019Get rights and content

Section snippets

A Nomogram is Not the Same Thing as a Prediciton Model

Urology researchers commonly use the terms nomogram and prediction model interchangeably. For example, authors of papers about prediction models often describe their aims as “to develop a nomogram.” But a nomogram is simply a graphic calculating device. For example, the first nomogram one of us saw was to calculate sample sizes for a clinical trial; other well-known nomograms include those for weather forecasting and electrical engineering. Urology researchers typically analyze datasets using

Predictive Accuracy is Not the Same as Discrimination

It is routine to see investigators make claims such as “our model has a predictive accuracy of 70.2%.” The 70.2% figure is typically either an area-under-the-curve (AUC) or a concordance index (C-index). Both the AUC and the C-index provide an estimate of the probability that the model will correctly identify which of two individuals with different outcomes actually has the disease (eg, AUC of a model to predict prostate cancer on biopsy) or had the event sooner (eg, C-index of a model to

Discrimination Depends on How Much the Predictors Vary

The C-index or AUC depends critically on the variation of the predictors in the study cohort. As a simple illustration, imagine that 2 models for life expectancy were published, both of which only included age as a predictor. In addition, both models reported exactly the same odds ratio of 1.10 for a one-year increase in age. However, one model was tested on a group of men aged 50 to 60 years and the other on a group of men aged 40 to 70 years. The C-index would be less than 0.60 for the first

Internal Validation Often Does Not Help Much

The discrimination and calibration of a model are assessed by applying it to a dataset and comparing predictions with outcome. Very commonly, the dataset to which the model is applied is the same as the one used to create the model in the first place. This is known as “internal validation,” and is somewhat problematic for two reasons. First, models tend to have above average fit to the data on which they were created simply because of the play of chance. As a trivial example, if the recurrence

The Concordance Index is Heavily Influenced by Length of Follow-Up

It is easier to predict what is going to happen tomorrow than in 5 years time. This means that a C-index must be interpreted in the context of follow-up time. As a simple example, imagine that Brown and Smith had each published separate nomograms predicting life expectancy, with the C-index for Brown being somewhat higher. The typical interpretation would be that Brown's model is superior. But it might also be that Smith's follow-up was longer. To illustrate this point, we looked at the

Calibration is Not an Inherent Property of a Prediction Tool

Investigators commonly describe the results of their study as showing that their model was “well calibrated.” But it is perfectly possible for a model to be well calibrated on one dataset and poorly calibrated on another. As a simple example, an investigator might develop a predictive tool for life expectancy and demonstrate that predicted risks are close to observed proportions in a variety of European and US datasets. However, the model might well have poor calibration when applied to a

A Good Model Can Be Clinically Useless, A Poor Model Very Valuable

Take the case of a model to predict the outcome of prostate cancer biopsy. The model might have good discrimination (e.g., 0.85) and perfect calibration, yet might have no clinical role. This could be because, for example, the predicted risks from the model ranged from 50% to 95%. Patients with prostate cancer tended to have higher predicted risks—leading to good discrimination—but no patient was given a probability low enough that a urologist would forgo biopsy. To put it another way,

A Predictor That Adds Accuracy to a Prediction Model May Not Be Worth Measuring

Just as it is possible for a model to be accurate, but useless, it is possible for a predictor to improve accuracy, but for this to make no difference to clinical decision making. Again, decision analytic techniques are required to investigate the clinical implications of using the marker.

Just Because You Can Create a Predictive Model Does Not Mean That You Should

Figure 1 shows a nomogram to predict the probability that a patient is aged over 70 on the basis of stage, grade, prostate-specific antigen, and treatment chosen (surgery vs radiotherapy). The model has high discrimination (AUC of 0.78) and good calibration (see Fig. 2). In other words, the model is terrific in all ways other than that it is completely useless. So why did we create it? In short, because we could: we have a dataset, and a statistical package, and add the former to the latter,

Conclusion

In conclusion, papers on nomograms often involve esoteric statistics—restricted cubic splines, for example, or bootstrap resampling—in an effort to fine tune models. But small differences in follow-up, or in the heterogeneity of a predictor, can result in very large differences in C-index or AUC. Moreover, minor differences between cohorts can similarly lead to substantial differences in predictive accuracy between internal and external validation.

Researchers are often told to KISS: “keep it

First page preview

First page preview
Click to open first page preview

References (9)

There are more references available in the full text version of this article.

Cited by (82)

  • How to Develop Statistical Predictive Risk Models in Oncology Nursing to Enhance Psychosocial and Supportive Care

    2020, Seminars in Oncology Nursing
    Citation Excerpt :

    It is worth noting that although these traditional performance statistics assess accuracy, they provide no information the consequences of using the model in practice. Newer techniques such as decision curve analysis (DCA) can be used to assess the potential benefits and costs of using the model, by evaluating the clinical consequence of using the model to intervene or not.98,99 Importantly DCA provides an estimate of net benefit, which combines the number of true-positives and false-positives, weighted by a factor of false-positives relative to false-negatives, into a net number (similar to net profit) in which the larger the number the better the model.

View all citing articles on Scopus

Supported in part by funds from David H. Koch provided through the Prostate Cancer Foundation, the Sidney Kimmel Center for Prostate and Urologic Cancers and P50-CA92629 SPORE grant from the National Cancer Institute to Dr. P. T. Scardino. From the Memorial Sloan-Kettering Cancer Center, New York, NY

View full text