Evaluating the variation in the strength of the effect across studies is a key feature of meta-analyses. This variability is reflected by measures like τ^{2} or I^{2}, but their clinical interpretation is not straightforward. A prediction interval is less complicated: it presents the expected range of true effects in similar studies. We aimed to show the advantages of having the prediction interval routinely reported in meta-analyses.

We show how the prediction interval can help understand the uncertainty about whether an intervention works or not. To evaluate the implications of using this interval to interpret the results, we selected the first meta-analysis per intervention review of the Cochrane Database of Systematic Reviews Issues 2009–2013 with a dichotomous (n=2009) or continuous (n=1254) outcome, and generated 95% prediction intervals for them.

In 72.4% of 479 statistically significant (random-effects p<0.05) meta-analyses in the Cochrane Database 2009–2013 with heterogeneity (I^{2}>0), the 95% prediction interval suggested that the intervention effect could be null or even be in the opposite direction. In 20.3% of those 479 meta-analyses, the prediction interval showed that the effect could be completely opposite to the point estimate of the meta-analysis. We demonstrate also how the prediction interval can be used to calculate the probability that a new trial will show a negative effect and to improve the calculations of the power of a new trial.

The prediction interval reflects the variation in treatment effects over different settings, including what effect is to be expected in future patients, such as the patients that a clinician is interested to treat. Prediction intervals should be routinely reported to allow more informative inferences in meta-analyses.

In many meta-analyses, there is large variation in the strength of the effect.

The prediction interval helps in the clinical interpretation of the heterogeneity by estimating what true treatment effects can be expected in future settings.

In case of heterogeneity, prediction intervals will show a wider range of expected treatment effects than CIs, and thus may lead to different conclusions. This occurred in over 70% of statistically significant meta-analyses with heterogeneity of the Cochrane Database of Systematic Reviews. Completely opposite effects were not excluded in over 20% of those meta-analyses.

Prediction intervals should be routinely reported to allow more informative inferences in meta-analyses.

Limitations are that the calculations and inferences for the prediction interval are based on the normality assumption, which is difficult to ensure. Further, the interval will be imprecise if the estimates of the summary effect and the between-study heterogeneity are imprecise, for example, if they are based on only a few, small studies. Inferences based on the prediction interval are only valid for settings that are similar (exchangeable) to those on which the meta-analysis is based.

Interventions may have heterogeneous effects across studies because of differences in study populations, interventions, follow-up length or other factors like publication bias.^{2} or the inconsistency measure I^{2}.^{2} or I^{2}. Reporting a prediction interval in addition to the summary estimate and CI will illustrate which range of true effects can be expected in future settings. We describe its merits and provide working examples to show how it can be calculated.

Between-study variation in the magnitude of treatment effects cannot be neglected. One of the main merits of a meta-analysis may even be that it reveals the variation of effects in different studies.^{2} and I^{2} with respect to the variation in the effects is limited. The clinical interpretation of I^{2} is ambiguous: a high I^{2} does not necessarily imply that the study effects are dispersed over a wide range^{2} might correspond to high dispersion,^{2} depends on sample size of the included studies.^{2}, while with small (imprecise) studies, very different treatment effects can yield an I^{2} of 0. Dispersion in treatment effects is better reflected by τ because τ is the SD of the between-study effects. One could, for example, estimate the ratio of the effect size over τ, which can convey how many times larger the treatment effect is compared with the SD of the effect across studies.

Not so often reported but much more insightful is the prediction interval.^{2} or I^{2}. A 95% prediction interval estimates where the true effects are to be expected for 95% of similar (exchangeable) studies that might be conducted in the future.

Some frequently used measures for heterogeneity

Measure | Advantages | Disadvantages |
---|---|---|

τ^{2} |
The τ (the square root of τ The τ |
A direct clinical interpretation based on τ When the τ |

I^{2} |
I I I |
A direct clinical interpretation of I I With very large studies, even tiny between-study differences in effect size may result in a high I With small (imprecise) studies, very different treatment effects can yield an I |

CI |
The CI in a random-effects model contains highly probable values for the summary (mean) treatment effect. |
The CI gives no information on the range of true treatment effects that are likely to be seen in other settings, for example, in the next study or in the patients a clinician wants to treat in her clinic. |

Prediction interval |
The prediction interval in a random-effects model contains highly probable values for the true treatment effects in future settings, if those settings are similar to the settings in the meta-analysis. The values in the interval can be compared with clinically relevant thresholds to see whether they correspond to benefit, null effects or harm. The prediction interval can be used to estimate the probability that the treatment in a future setting will have a true-positive or true-negative effect, and to perform better power calculations. |
Conclusions drawn from the prediction interval are based on the assumption that τ The estimate of the prediction interval will be imprecise if the estimates of the summary effect and the τ |

A 2012 review on the use of topical steroids for treatment of chronic rhinosinusitis with nasal polyps, based on seven randomised studies, resulted in a larger decrease in overall symptom scores in favour of steroids compared with placebo.^{2} is 73.9% (95% CI 44.2% to 87.8%), which can be considered substantial heterogeneity,^{2} is 0.148. Notwithstanding these numbers, it is difficult to evaluate what the clinical consequences of this heterogeneity may be for future settings.

Forest plot of the standardised mean difference (SMD) in symptom scores in nasal polyps. Steroids versus placebo, analysis 1.1 in Cochrane Review CD006549.^{2}.

In order to estimate the prediction interval for the SMD, we need the point estimate of the SMD, its SE and the estimated τ^{2}. We derive the SE from the 95% CI of the SMD (see online _{PI} as √(0.148+0.227_{PI}. The value 2.45 results from the t_{1−0.05/2,6} distribution. Prediction intervals with a different coverage could be calculated by using a different t-value, for example, t_{1−0.20/2,6} for an 80% prediction interval (see online

The resulting prediction interval, ranging from −1.60 to 0.58, can be interpreted as the 95% range of true SMDs to be expected in similar studies. We present it in

In order to investigate how often there is a discrepancy in conclusions based on prediction intervals and CIs, we evaluated this in statistically significant meta-analyses (p<0.05 by random-effects calculations) of the Cochrane Database of Systematic Reviews Issues 2009–2013, kindly provided by the UK Cochrane Editorial Unit. To avoid subjectivity in the selection, we used the first meta-analysis with a dichotomous or continuous outcome and based on at least two studies in the data and analyses section when these studies were also combined in the original review, as we wanted to reflect the status quo as precise as possible. Details can be found in another paper.^{2}>0 and 441 with an estimated I^{2}=0.

We used the Hartung-Knapp/Sidik-Jonkman^{2}. We estimated τ^{2} for all meta-analyses, even when the authors originally performed a fixed-effects analysis. Prediction intervals were calculated according to online ^{2}>0) by number of studies (2–6 studies or >6) and heterogeneity (I^{2}<30%, 30% to 60% or >60%, based on the Cochrane Handbook^{2} between 30% and 60% corresponds to moderate heterogeneity). For significant meta-analyses where the heterogeneity estimate was zero, we assessed the impact of possibly low but non-zero heterogeneity by assuming an I^{2} of 20%, calculating prediction intervals using online ^{2} test. We used R software(R: A language and environment for statistical computing. Retrieved from

Overall, 132 (27.6%) of the 479 statistically significant meta-analyses with an I^{2}>0 had both the 95% CI and the 95% prediction interval excluding the null effect (

Proportion of statistically significant meta-analyses where both the 95% CIs and PIs excluded the null

Statistically significant meta-analyses | Estimated heterogeneity I^{2} | ||||
---|---|---|---|---|---|

I^{2}=0* | I^{2}>0 | >0 and <30% | 30% to 60% | >60% | |

N | 441 | 479 | 123 | 150 | 206 |

Both 95% CI and 95% PI excluded null, n (%) | 112 (25.4) | 132 (27.6) | 88 (71.5) | 39 (26.0) | 5 (2.4) |

*When the estimated heterogeneity I^{2} was equal to 0, I^{2}=20% was imputed for the calculation of the PI.

PI, prediction interval.

Of the 347 meta-analyses with a prediction interval that contained the null or opposite effect, 199 (57.3%) had also at least one study with an opposite effect. This happened more often in meta-analyses with more than six studies (181/235, 77.0%) than in those based on at most six studies (18/102, 17.6%). Especially in meta-analyses with few studies and substantial heterogeneity, the prediction interval was wider than the range of study outcomes. The opposite (ie, a smaller prediction interval) occurred in meta-analyses based on many studies and with low estimated heterogeneity. Results for meta-analyses with dichotomous and continuous outcomes were not notably different.

If the prediction interval just includes the null effect, this may be less worrying than when it contains the exact opposite effect of the pooled summary effect, for example, if it contains an OR of 0.5 when the meta-analysis summary estimate is an OR of 2, or if it contains an SMD of ^{2}>0,97 (20.3%) had a prediction interval that contained the opposite effect. This percentage was higher for the meta-analyses with a continuous outcome (65/219, 29.7%) than for those with a dichotomous outcome (32/260, 12.3%; p<0.001). It occurred also more frequently in meta-analyses with more than six primary studies (57/139, 41.0% and 30/178, 20.3% for meta-analyses with a continuous or dichotomous outcome, respectively) than for those based on at most six studies (8/80, 10.0% and 2/82, 2.4%; p<0.001 and p=0.001, respectively).

A substantial part of meta-analyses have an estimated I^{2} of 0. However, there is typically very large uncertainty about the exact amount of heterogeneity, and this is demonstrated by very large 95% CIs for the values of I^{2}.^{2} and τ are unlikely to ever be exactly 0, although low values are possible. To assess the impact of possibly low but non-zero heterogeneity among the 441 Cochrane meta-analyses with estimated I^{2}=0 and statistically significant results, we imputed an I^{2}=20% (suggestive of low between-study heterogeneity). Under this assumption, in 329 (74.6%) of these 441 meta-analyses the 95% prediction interval would span both sides of the null (

In meta-analyses, a CI is inadequate for clinical decision-making because it only summarises the average effect for the average study. The prediction interval is more informative as it shows the range of possible effects in relation to harm and clinical benefit thresholds. While we have focused on the situation where the separating threshold is the null, a different threshold may be considered. For example, in the prediction interval framework, one can calculate the probability that an effect is larger than B, where B may be a clinically meaningful effect (if the treatment benefit is less than B, then it is felt not to be worth it). A narrow prediction interval that lies completely on the beneficial side of a clinically relevant threshold increases confidence in an intervention. A broad prediction interval may indicate the existence of settings where the treatment has a suboptimal and possibly even harmful effect. In more than 70% of statistically significant meta-analyses of the Cochrane Database with some estimated or assumed between-study heterogeneity, the prediction intervals crossed the no-effect threshold, indicating that there are settings where those treatments will have no effect or even an effect in the opposite direction. In 20.3% of those meta-analyses, the prediction interval even contained the opposite effect of the summary estimate, for example, an OR of 0.5 when the summary point estimate was an OR of 2. This occurred most frequently for meta-analyses with a continuous outcome, probably because heterogeneity can be more prominent in many topics where outcomes are assessed on continuous scales; higher heterogeneity for the continuous outcomes was also observed in the full set of 3263 meta-analyses.

Graham and Moran

It is straightforward to calculate a prediction interval if we can assume that the effects are normally distributed and that τ^{2} is known and stable across studies. However, one should realise that the prediction interval is dependent on this assumption and on the precisions of the estimated τ^{2} and study effect, and will be imprecise if the number of studies in the meta-analysis is small. If the number of studies is large, estimates will be more precise and the normality of the distribution of τ^{2} can be empirically evaluated. A final caveat is that the uncertainty conveyed by the prediction interval pertains to the uncertainty about the extent to which future studies are similar (exchangeable) to those that have already been done, but this applies to all inferences from a meta-analysis. If the future studies evaluate patients and settings that are entirely different from what was evaluated in past studies, this exchangeability is questionable and uncertainty may be even more prominent than what the prediction interval conveys. In practical terms, if the patients treated by a physician are considered to be very different from the patients seen in all studies that have been done in the past, even the prediction interval cannot tell us what we might expect for these patients.

Meta-analysis results can also be used for power calculations for a new study. However, the expected true effect in a new study is not necessarily equal to the point estimate of the meta-analysis: it can be any of the values in the prediction interval. In case of heterogeneity, the probability of a statistically significant result in a new study may differ substantially from an apparent power of 80% based on the point estimate. The latter will be overly optimistic because the power function is asymmetric. If the true study effect is larger than the point estimate, the real probability of a significant study will be higher, up to a maximum of 100%, but if the effect is smaller, the probability may decrease substantially, even to 5% or less in case of a null effect. Consequently the expected probability of a significant new study in case of heterogeneity will be lower than 80% ( online

Summarising, the prediction interval reflects the variation in true treatment effects over different settings, including what effect is to be expected in future patients, such as the patients that a clinician is interested to treat. Therefore, it should be routinely reported in addition to the summary effect and its CI, and used as a main tool for interpreting evidence, to enable more informed clinical decision-making.

JI originated the idea for this study together with JJG. JI drafted the manuscript and conducted the data analysis. All authors read and critically revised the manuscript for important intellectual content and approved the final manuscript.

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

None declared.

Not commissioned; externally peer reviewed.

Data sets are available on request from the corresponding author.

^{2}in assessing heterogeneity may mislead