GRADE Series - Guest Editors, Sharon Straus and Sasha Shepperd
GRADE guidelines: 4. Rating the quality of evidence—study limitations (risk of bias)

https://doi.org/10.1016/j.jclinepi.2010.07.017Get rights and content

Abstract

In the GRADE approach, randomized trials start as high-quality evidence and observational studies as low-quality evidence, but both can be rated down if most of the relevant evidence comes from studies that suffer from a high risk of bias. Well-established limitations of randomized trials include failure to conceal allocation, failure to blind, loss to follow-up, and failure to appropriately consider the intention-to-treat principle. More recently recognized limitations include stopping early for apparent benefit and selective reporting of outcomes according to the results. Key limitations of observational studies include use of inappropriate controls and failure to adequately adjust for prognostic imbalance. Risk of bias may vary across outcomes (e.g., loss to follow-up may be far less for all-cause mortality than for quality of life), a consideration that many systematic reviews ignore. In deciding whether to rate down for risk of bias—whether for randomized trials or observational studies—authors should not take an approach that averages across studies. Rather, for any individual outcome, when there are some studies with a high risk, and some with a low risk of bias, they should consider including only the studies with a lower risk of bias.

Introduction

Key points

  • In the GRADE approach, both randomized trials (which start as high quality evidence) and observational studies (which start as low quality evidence) can be rated down if relevant evidence comes from studies that suffer from a high risk of bias.

  • Risk of bias can differ across outcomes when, for instance, each outcome is informed by a different subset of studies (e.g. mortality from some trials, quality of life from others).

  • Current systematic reviews are often limited in their usefulness for guidelines because they rate risk of bias by studies across outcomes rather than by outcome across studies.

In three previous articles in our series describing the GRADE system of rating the quality of evidence and grading the strength of recommendations, we have described the process of framing the question and introduced GRADE’s approach to rating the quality of evidence. This fourth article deals with one of the five categories of reasons for rating down the quality of evidence, study limitations (risk of bias).

Section snippets

Rating down quality for risk of bias

Both randomized controlled trials (RCTs) and observational studies may incur additional risk of misleading results if they are flawed in their design or conduct—what other publications refer to as problems with “validity” or “internal validity” and we label “study limitations” or “risk of bias.”

Study limitations in randomized trials

Readers can refer to many authoritative discussions of the study limitations that often afflict RCTs (Table 1). Two of these discussions are particularly consistent with GRADE’s conceptualization, which include a focus on outcome specificity (i.e., the focus of risk of bias is not the individual study but rather the individual outcome, and quality can differ across outcomes in individual trials, or a series of trials [1], [2]). We shall highlight three of the criteria in Table 1. The importance

Stopping early for benefit

Theoretical consideration [6], simulations [7], and empirical evidence [8] all suggest that trials stopped early for benefit overestimate treatment effects. The most recent empirical work suggests that in the real world, formal stopping rules do not reduce this bias, that it is evident in stopped early trials with less than 500 events and that on average the ratio of relative risks in trials stopped early vs. the best estimate of the truth (trials not stopped early) is 0.71 [9].

Because in most

Selective outcome reporting

When authors or study sponsors selectively report positive outcomes and analyses within a trial, critics have used the label “selective outcome reporting.” Recent evidence suggests that selective outcome reporting, which tends to produce overestimates of the intervention effects, may be widespread [4], [10], [11], [12], [13].

For example, a systematic review of the effects of testosterone on erection satisfaction in men with low testosterone identified four eligible trials [14]. The largest

Loss to follow-up

Historically, methodologists have sometimes suggested arbitrary thresholds for acceptable loss to follow-up (e.g., less than 20%). The significance of particular rates of loss to follow-up, however, varies widely and is dependent on the relation between loss to follow-up and number of events. For instance, loss to follow-up of 5% in both intervention and control groups would entail little threat of bias if event rates were 20% and 40% in intervention and control groups, respectively. If event

Study limitations in observational studies

Systematic reviews of tools to assess the methodological quality of nonrandomized studies have identified more than 200 checklists and instruments [16], [17], [18], [19]. Table 2 summarizes key criteria for observational studies that reflect the contents of these checklists. Judgments associated with assessing study limitations in observational studies are often complex; here, we address two key issues that arise in assessing risk of bias.

Limitations of GRADE’s approach to assessing risk of bias in individual studies

GRADE’s approach to assessing risk of bias shares two fundamental limitations with the very large number of alternative approaches. First, empirical evidence supporting the criteria is limited—attempts to show systematic difference between studies that meet and do not meet specific criteria have shown inconsistent results. Second, the relative weight one should put on the criteria remains uncertain.

The GRADE approach is less comprehensive than many systems, emphasizing simplicity and parsimony

Summarizing study limitations must be outcome specific

Sources of bias may vary in importance across outcomes. Thus, within a single study, one may have higher quality evidence for one outcome than for another. For instance, RCTs of steroids for acute spinal cord injury measured both all-cause mortality and, based on a detailed physical examination, motor function [23], [24], [25]. Blinding of outcome assessors is irrelevant for mortality but crucial for motor function. Thus, as in this example, if the outcome assessors in the primary studies on

Summarizing risk of bias requires consideration of all relevant evidence

Every study addressing a particular outcome will differ, to some degree, in risk of bias. Review authors and guideline developers must make an overall judgment, considering all the evidence, whether quality of evidence for an outcome warrants rating down on the basis of study limitations.

Table 3 presents the structure of GRADE’s approach to study limitations in RCTs. The second column in Table 3 presents the approach as applied to individual studies; the remaining columns refer to the entire

Existing systematic reviews are often limited in summarizing study limitations across studies

To rate overall quality of evidence with respect to an outcome, review authors and guideline developers must consider and summarize study limitations considering all the evidence from multiple studies. For a guideline developer, using an existing systematic review would be the most efficient way to address this issue.

Unfortunately, systematic reviews usually do not address all important outcomes, typically focusing on benefit and neglecting harm. For instance, one is required to go to separate

What to do when there is only one RCT

Many people are uncomfortable designating a single RCT as high-quality evidence. Given the many instances in which the first positive report has not held up under subsequent investigation, this discomfort is warranted. On the other hand, automatically rating down quality when there is a single study is not appropriate. A single, very large, rigorously planned and conducted multicentre RCT may provide high-quality evidence. GRADE suggests especially careful scrutiny of all relevant issues (risk

Moving from Cochrane risk of bias tables in individual studies to rating quality of evidence across studies

Moving from 6 risk of bias criteria for each individual study to a judgment about rating down for quality of evidence for risk of bias across a group of studies addressing a particular outcome presents challenges. We suggest the following principles.

First, in deciding on the overall quality of evidence, one does not average across studies (for instance if some studies have no serious limitations, some serious limitations, and some very serious limitations, one does not automatically rate

Application of principles

A systematic review of flavonoids to treat pain and bleeding associated with hemorrhoids [36], with respect to the primary outcome of persisting symptoms, most trials did not provide sufficient information to determine whether randomization was concealed, the majority violated the intention-to-treat principle and did not provide the data allowing the appropriate analysis (Table 5), and none used a validated symptom measure. On the other hand, most authors described their trials as double blind,

Recording judgments about study limitations

One great merit of GRADE is its lucid categorization of factors that decrease quality of evidence and the resultant transparency of judgments. This transparency, however, requires careful documentation of judgments. Including a risk of bias table that summarizes key criteria used to assess study limitations for each outcome for each study helps ensure transparency.

Table 5 presents an example of such a table. Note that the table focuses on only one outcome, symptoms. Each study will need only

References (38)

  • L. Wood et al.

    Empirical evidence of bias in treatment effect estimates in controlled trials with different interventions and outcomes: meta-epidemiological study

    BMJ

    (2008)
  • S.J. Pocock

    When (not) to stop a clinical trial for benefit

    JAMA

    (2005)
  • S.J. Pocock et al.

    Practical problems in interim analyses, with particular regard to estimation

    Control Clin Trials

    (1989)
  • V.M. Montori et al.

    Randomized trials stopped early for benefit: a systematic review

    JAMA

    (2005)
  • D. Bassler et al.

    Stopping randomized trials early for benefit and estimation of treatment effects: systematic review and meta-regression analysis

    JAMA

    (2010)
  • T.A. Furukawa et al.

    Association between unreported outcomes and effect size estimates in Cochrane meta-analyses

    JAMA

    (2007)
  • A.W. Chan et al.

    Identifying outcome reporting bias in randomised trials on PubMed: review of publications and survey of authors

    BMJ

    (2005)
  • A.W. Chan et al.

    Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles

    JAMA

    (2004)
  • A.W. Chan et al.

    Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research

    CMAJ

    (2004)
  • Cited by (2041)

    • Preventing postpartum hemorrhage: A network meta-analysis on routes of administration of uterotonics

      2024, European Journal of Obstetrics and Gynecology and Reproductive Biology
    View all citing articles on Scopus

    The GRADE system has been developed by the GRADE Working Group. The named authors drafted and revised this article. A complete list of contributors to this series can be found on the Journal of Clinical Epidemiology Web site.

    View full text