Objectives Assess the frequency and reasons for disagreements in risk of bias assessments for randomised controlled trials (RCTs) included in more than one Cochrane review.
Design Research on research study, using cross-sectional design.
Data sources 2796 Cochrane reviews published between March 2011 and September 2014.
Data selection RCTs included in more than one review.
Data extraction Risk of bias assessment and support for judgement for five key risk of bias items.
Data synthesis For each item, we compared risk of bias assessment made in each review and calculated proportion of agreement. Two reviewers independently analysed 50% of all disagreements by comparing support for each judgement with information from study report to evaluate whether disagreements were related to a difference in information (eg, contact the study author) or a difference in interpretation (same support for judgement but different interpretation). They also identified main reasons for different interpretation.
Results 1604 RCTs were included in more than one review. Proportion of agreement ranged from 57% (770/1348 trials) for incomplete outcome data to 81% for random sequence generation (1193/1466). Most common source of disagreement was difference in interpretation of the same information, ranging from 65% (88/136) for random sequence generation to 90% (56/62) for blinding of participants and personnel. Access to different information explained 32/136 (24%) disagreements for random sequence generation and 38/205 (19%) for allocation concealment. Disagreements related to difference in interpretation were frequently related to incomplete or unclear reporting in the study report (83% of disagreements related to different interpretation for random sequence generation).
Conclusions Risk of bias judgements of RCTs included in more than one Cochrane review differed substantially. Most disagreements were related to a difference in interpretation of an incomplete or unclear description in the study report. A clearer guidance on common causes of incomplete information may improve agreement.
- risk of bias
- systematic reviews
- interrater agreement
- public health
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
Use of a very large and comprehensive collection of Cochrane reviews to assess the agreement in risk of bias assessment and to understand reasons of disagreement.
Analysis of the full text of study reports to underline what information was available to review authors and how they used them while assessing risk of bias.
Focus on disagreements only. Possible that a proportion of agreements happened ‘by chance’. For example, review authors may express the same risk of bias judgement while using different information or interpreting information differently.
No evaluation of the potential impact of disagreements in conclusion making at the review level.
Systematic reviews aim to synthesise all existing evidence for a research question by the use of a rigorous and reproducible methodology.1 Because reviews may be affected by bias at the level of individual studies,2 an assessment of the risk of bias in these studies is a crucial step in conducting a systematic review.3 4
Cochrane has developed a tool to provide a standardised approach to the assessment of the risk of bias in randomised controlled trials (RCTs).5 The risk of bias tool is based on specific characteristics related to study design and conduct, selected on theoretical grounds and on empirical evidence from meta-epidemiological studies that these characteristics are associated with differences in treatment effect estimates.6–11 The tool includes seven items (random sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessment, incomplete outcome data, selective reporting, other source of bias), which the researchers assess and judge as either ‘high’, ‘low’ or ‘unclear’ risk of bias.11 12
Although Cochrane provides detailed guidance on how to use the tool and recommends consensus between two independent reviewers,11 personal judgement is also involved, which may bring variability. Several studies have evaluated the reproducibility of the risk of bias tool, generally shown to be poor.12–19 However, there is some uncertainty about the main causes of disagreements. For example, some reviewers may search for additional information such as protocols or contact study authors and this difference in available information, rather than a difference in judgement, may explain some of the disagreements.
In this study, we used a large collection of Cochrane reviews to evaluate the reproducibility of risk of bias assessments by identifying RCTs included in more than one Cochrane review and comparing the assessments. In addition, we examined the likely reasons for any disagreements. In particular, we evaluated whether disagreements were related to differences in information available to reviewers or differences in interpreting the same information and what could explain such different interpretation.
This is a research on research study on risk of bias assessment, which used a cross-sectional design. We identified RCTs included in more than one reviews included in a large collection of Cochrane reviews. For key risk of bias items, we evaluated agreement between the different systematic reviews; analysed whether disagreements were related to a difference in information available to reviewers or a difference in interpretation of the same information and highlighted the main reasons for disagreements by an in-depth, one-by-one evaluation of disagreements.
We obtained data from the 2796 Cochrane reviews, which correspond to all the reviews available in the Cochrane library between March 2011 and September 2014, including updates (March 2011 corresponds to the last update of the risk of bias tool5). Data consisted of one XML file per review, each file containing all data entered by review authors in RevMan, the software used for managing Cochrane reviews.20 All individual XML files were merged in a single database by using R V.3.2.221 with the XML package.22 The vocabulary used for risk of bias items slightly varied across reviews (eg, some reviews could refer to ‘allocation concealment’ as ‘allocation masking’). For this reason, two authors independently evaluated all terms used and classified them according to the vocabulary of the tool. Disagreements were resolved by consensus. This standardisation was done for a previous publication.23
Selection of eligible reviews
We excluded withdrawn or ‘empty’ reviews (ie, systematic reviews not including any study) as well as reviews including observational or non-randomised studies and considered only reviews with an assessment of risk of bias for at least one item of the risk of bias tool.
Selection of eligible RCTs
To identify single RCTs included and assessed for risk of bias in more than one systematic review, we proceeded as follows. For each RCT, we identified the primary reference(s), which was the reference identified by review authors as the main reference(s) for an included study. Then, we used a matching algorithm24 to identify studies that shared the same primary reference. If several primary references were reported, we considered all of them. We manually checked that the studies sharing the same primary reference in the reviews corresponded to the same RCT.
Extraction of risk of bias assessment
For each eligible RCT, we extracted the risk of bias assessment and the corresponding support for judgement for each risk of bias item in each review. Whenever a single RCT was included in three or more reviews, we considered only the risk of bias assessment from two reviews chosen at random; this was decided because of workload and to facilitate direct comparison of two assessments and concerned less than 10% of our included RCTs. We focused on five risk of bias items: random sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessment and incomplete outcome data. We did not consider selective reporting because it is difficult to evaluate in the absence of the study protocol, which is frequently lacking, especially for older studies.11 12 14 We also did not consider the item other bias because the definition is very wide (ie, ‘any important concerns about bias not covered in the other domains in the tool’11), so comparisons across reviews are difficult.
Comparison of risk of bias assessment between reviews
For each item, we compared the risk of bias assessment in terms of ‘high’, ‘low’ or ‘unclear’ risk of bias between the two reviews. According to the Cochrane handbook, the items blinding of outcome assessment and incomplete outcome data should be assessed for each outcome. Therefore, when the reviews reported an assessment of these items at the outcome level, we manually checked that outcomes were identical in both reviews and we retained for our analysis only the assessments that focused on the same outcomes. For blinding, we followed the last version of the Cochrane handbook and we retained only assessments of blinding of participants and personnel and blinding of outcome assessment as two independent items, excluding different types of assessment (ie, blinding as a single item, blinding of only participants or of only personnel).
We calculated the percentage agreement for each risk of bias item, as the proportion of studies with a concordant assessment in both reviews (eg, ‘low’ risk of bias AND ‘low’ risk of bias). Not all reviews assessed all five key risk of bias items for each RCT included; consequently, the number of RCTs evaluated for discrepancies varies depending on the item considered.
Selection of studies for in-depth analysis of disagreements
For workload reasons, we in-depth evaluated the reasons for disagreements for 50% of the studies analysed in the previous step. In cases of more than one shared RCTs within a given pair of Cochrane reviews, we selected only one RCT at random. To reach 50% of the total sample, we used a simple random selection in the remaining database.
Classification of disagreements
For the random selection, two reviewers (LB and AD) independently evaluated all disagreements in the risk of bias assessment in the two systematic reviews. They first scrutinised the support for the judgement in each review and evaluated whether it was the same or ‘conceptually’ the same in both reviews (eg, ‘randomised, probably done’; ‘randomised, probably not done’; ‘study only mentions randomisation, but does not specify how randomisation was performed; unclear’; ‘study states it is randomised; low risk’). If the support differed, they assessed any other information regarding the study as reported in both reviews, systematically searching and evaluating the full-text study report indicated in the primary reference. A formalised data extraction process for full texts was not used. Full texts were examined, looking primarily for correspondence between information reported by the reviewers in their support for judgement and the text.
They independently classified each case of disagreement as follows:
Disagreement related to differences in interpretation:
The support for judgement was the same (or ‘conceptually’ the same) in both reviews, but the interpretation differed.
One review clearly confused one item of the risk of bias tool with a different one or the review authors misunderstood the definition of the item (eg, for random sequence generation, support for judgement reports ‘600 opaque envelopes, 1 was drawn every time’).
Disagreement related to differences in information: the support for judgement cites information that is not available in the study report; additional sources are cited (eg, protocol) or the review authors reported that they had contacted the RCT author for additional data.
Disagreement related to information missed by the review authors: the study report clearly describes the information, but some review authors seemed to have missed this information in the study report.
Disagreement related to input mistakes: risk of bias assessment in terms of ‘high’/‘low’/‘unclear’ did not match the support for the judgement (eg, ‘randomisation described explicitly’, judgement ‘unclear’).
Unclear: when it was not possible to classify the disagreement because the support for the judgement was empty or because we could not retrieve the full-text study report.
Any disagreements between reviewers were solved by discussion to reach consensus. In the online supplementary appendix 1, we report a figure synthetising how the in-depth analysis process was conducted.
Supplementary file 1
Identification of main reasons for different interpretation
For each disagreement related to a difference in interpretation, we evaluated the probable reason for disagreement. For example, the interpretation could differ because of confusion with another risk of bias item (eg, random sequence generation and allocation concealment) or because the information was unclear or insufficiently detailed in the article. When we were unsure about the reason, we classified the reason as unclear. Two authors (LB and AD) conducted this process in duplicate by using all available information (ie, support for the judgement, characteristics of the study reported in the review, full-text article), with disagreements resolved by discussion.
Analysis was descriptive with use of frequencies and percentages for qualitative variables. Statistical analysis was conducted with Stata V.13.1.25 We decided to use simple per cent agreement because other static approaches were problematic. The Kappa statistic requires having defined reviewers, which is not the case of our approach. Another statistic, the intraclass correlation coefficient is not suitable, because it requires assessments to be in an ordinal order, which is not our case. There is no continuum between the assessments of low, unclear and high risk of bias.
Patients were not involved in any aspect of the study design, conduct or the development of the research question or outcome measures. This is a research-on-research study, and therefore, there was no active patient recruitment for data collection.
Figure 1 shows the selection process. From the 2796 systematic reviews published between March 2011 and September 2014, 2291 reviews included RCTs only and reported a risk of bias assessment. Of these, 797 included at least one RCT whose primary reference was shared with another review for which a risk of bias assessment was reported. These 797 reviews included 1604 single RCTs evaluated for the same risk of bias item in more than one review. The online supplementary appendix 2 reports the frequency of the different Cochrane groups among those reviews.
Supplementary file 2
Among the 1604 selected RCTs: 1603 had duplicate evaluation for allocation concealment, 1466 for random sequence generation, 375 for blinding of participants and personnel, 583 for blinding of outcome assessment and 1348 for incomplete outcome data.
Evaluation of agreement and distribution of disagreements
The agreement of risk of bias judgements ranged from 57% (770/1348 trials) for incomplete outcome data to 81% (1193/1466 trials) for random sequence generation (figure 2). We identified most disagreements for ‘low’ and ‘unclear’ risk of bias judgements, especially for random sequence generation (231/273 trials, 85%). Disagreements between ‘low’ and ‘high’ risk of bias were generally rare, for example 8/273 of disagreements (3%) for random sequence generation, with the exception of incomplete outcome data for which they were more frequent (190/578, 33%). For blinding of participants and personnel, the most frequent disagreement was between ‘unclear’ and ‘high’ risk of bias (50/107, 47%), then ‘low’ versus ‘unclear’ (34/107, 32%), and ‘low’ versus ‘high’ (23/107, 21%) (figure 2).
Classification of disagreements
The in-depth analysis of disagreements included 802 studies: 799 for allocation concealment, 747 for random sequence generation, 206 for blinding of participants and personnel, 297 for blinding of outcome assessment and 660 for incomplete outcome data. The agreement results of this sample and the distribution of disagreements are reported in the online supplementary appendix 3.
Supplementary file 3
For all items, the most common source of disagreement was a difference in interpretation, with frequencies ranging from 88/136 (65%) for random sequence generation to 56/62 (90%) for blinding of participants and personnel (figure 3). The access to additional or different information accounted for disagreements in 32/136 (24%) trials for random sequence generation and 38/205 (19%) for allocation concealment. Access to additional information was less common for the remaining items, with proportions ranging from 2% to 4%. In 80% of the cases, the access to additional information was through the contact of the study author.
The other sources of disagreement were less common; input mistake ranged from 1% to 6%, missed information from 1% to 6%. We could not determine the source of disagreement in 5% of our disagreements. For this analysis, we accessed the full text of 216 different trials to help us in the process. The online supplementary appendix 4 reports some examples of disagreements in which the access to the study report helped us in the classification and the analysis of reasons of disagreement. We could not retrieve or access 19 full texts we deemed necessary for the categorisation of disagreements and this explains the majority of cases where we were unable to categorise the source of disagreement (‘unclear’ source in figure 3).
Supplementary file 4
Main reasons of disagreements for different interpretation
The main reasons for a difference in interpretation for each item are reported in table 1. Additional examples are provided for each item for the high-low disagreements (online supplementary appendix 5). The most common reason across items was related to incomplete or unclear reporting in the RCT. For random sequence generation, disagreements in 73/88 (83%) trials were related to lack of a precise description of the randomisation process with reviewers evaluating ‘low’, ‘high’ or ‘unclear’ risk of bias the reporting of ‘randomised’ in the text. For allocation concealment, the most common reason for disagreement was a different interpretation of description of the envelopes used to conceal allocation (17%, n=26/149 trials). For the two blinding items, many disagreements occurred when the article mentioned only ‘double blind’ in RCTs without an additional description (16% of cases, n=9/56 trials for blinding of participants and personnel, 13%, n=9/70 for blinding of outcome assessment). For incomplete outcome data, reviewers assessed differently the statement from the study report of ‘no missing data’ or ‘all data reported’ (10%, 22/220 trials). Another common reason for a difference in interpretation was confusion with another item. Allocation concealment was confused with blinding (10%, n=15/149 trials) but also with random sequence generation (4%, n=6/149). For blinding of participants and personnel, the most common cause for disagreement concerned the interpretation of cases when blinding was not feasible (36%, n=20/56 trials), assessed at high risk by some reviewers and low by others. Another common cause of disagreement for the two blinding items related to the assessment of outcomes that should not be affected by blinding (eg, mortality); it explained 21% (n=12 trials) of disagreements for blinding of participants and personnel and 23% (n=16 trials) for blinding of outcome assessment, often low versus high disagreements.
Supplementary file 5
For incomplete outcome data, the use of different cut-offs for the rate of missing data is the most common reason for disagreement (26%, n=57 trials); also common is considering the explanation of reasons for missing data enough to attribute a low risk of bias (13%, n=28 trials).
In this study, we took advantage of a very large sample of Cochrane reviews to explore the sources of disagreements in risk of bias assessment for trials included in several reviews. We decided to focus on Cochrane reviews because as these reviews are produced within a single organisation, therefore, we expected results and procedures to be more appropriately comparable. Authors compiling Cochrane reviews are members of the organisation and, in most cases, they underwent a similar training for assessing risk of bias. Our results confirm that the agreement for risk of bias assessments is generally suboptimal, with better agreement for random sequence generation and allocation concealment and less agreement for incomplete outcome data. Access to different sources of information explained why 24% of the trials had disagreements in the assessment of risk of bias for random sequence generation and 19% for allocation concealment. However, the main source of disagreements was a difference in interpretation of the same information, which was frequently related to incomplete or unclear reporting in the study report.
Strengths and weaknesses
Our study goes beyond previous literature on the topic.3 12–18 26 As compared with most other studies,12–17 we used real-world data to explore agreement of risk of bias assessments in real scenarios. We evaluated a very large and comprehensive collection of Cochrane reviews that spanned multiple specialties and topics, including a number of trials about 10 times larger than the largest study on the topic.12 We completed our analysis by searching individual study reports to give support to our comments on reasons for disagreements, which, to our knowledge, has not been done in previous, smaller works that used a similar methodology.18 While doing this, we developed a suitable classification scheme for sources of disagreements and conducted, in duplicate, an extensive analysis to understand the risk of bias assessment process and explored the most common reasons for disagreements.
Our study has limitations. Whenever a single RCT was included in three reviews or more, we considered only the risk of bias assessment from two reviews chosen at random. Nevertheless, we cannot exclude that different combinations of two chosen evaluations could have produced slightly different results. Although the classification of disagreements was conducted in duplicate following a formalised process, there remains a component of personal judgement. We evaluated only disagreements, but a number of agreements might have occurred ‘by chance’. In our analysis of likely reasons for disagreements, some resulted from confusion between risk of bias items. Similar discrepancies might have occurred among agreements; indeed, previous literature on the topic demonstrated that reviewers do not accurately follow the risk of bias tool.27 We also did not assess the selective reporting item that is frequently judged on incomplete information. We did not evaluate whether disagreements varied depending on the Cochrane review group or year of publication. Finally, we did not evaluate the impact of disagreements and the extent to which the evidence base for making conclusions and providing summary statements of effectiveness may have been affected by changing the rating.
Comparison with other studies
Our findings confirm the importance of issues that were previously identified by Jørgensen et al3 and Savović et al.26 In particular, Savović et al,26 surveying users of the risk of bias tool, reported on the possibility of confusion between random sequence generation and allocation concealment and between allocation concealment and blinding; the uncertainty on how to address unfeasibility of blinding; and the difficulties in assessing incomplete outcome data especially regarding the acceptable rate of missing data. More recently, Jørgensen et al,3 evaluating comments on the use of the risk of bias tool, highlighted how authors complained that judgement often originates from incomplete or missing information.
A previous study identified 46 RCTs included in different systematic reviews in the field of fertility and evaluated the percentage agreement in risk of bias assessment. That analysis showed generally worse agreement than in our study, with percentage agreement ranging from 35% to 71%. Differences in sample size and the particular topic may explain these discrepancies. In addition, although the authors had compared supports for judgement between reviews, this evaluation may have been incomplete, because they did not evaluate the primary study reports.18
Our results confirm that the agreement in risk of bias assessment would be enhanced by more detailed guidance in use of the risk of bias tool with particular focus on common causes of disagreements. We showed that in many cases, the unclear reporting from source material allows reviewers ample space for personal judgement and differences in judgement.
The scientific community continues to stress the importance of improving the reporting of trials,28–31 which may limit disagreements when assessing risk of bias. In parallel, we could also work on restricting the space for personal interpretation when assessing risk of bias. A suggestion could be to give clearer instruction on how to evaluate common cases, for example, when confronted with nothing more than the term ‘randomised’ or ‘double blind’ in the study report. Similarly, a threshold could be set on the quota for missing data and indications on which imputation methods are appropriate and in which situations.
To minimise research waste, it could be interesting to have access to risk of bias assessments from other Cochrane groups and the supports they used, including information from authors or from protocols to help reviewers in their assessments. This process would imply having a unique study identification number across reviews and a central shared repository for all studies included in any Cochrane reviews.
Following the suggestions based on the findings and comments provided by Jørgensen et al3 and Savović et al,26 Cochrane has been working on a new version of the risk of bias tool, which has recently been released.32 33 The new version has a different approach to the risk of bias assessment, guiding reviewers through the process with the use of ‘signalling questions’, which might leave less room for subjectivity. In addition, there is more guidance in assessing some items. For example, the new tool better clarifies some aspects of the randomisation process, especially about what to do in some cases of incomplete information (eg, randomisation list created by an external centre with no other indication). The new tool also has a different approach to the blinding aspect, oriented to the implications of the masking process. However, the new tool does not cover some of our concerns, especially those related to incomplete outcome data: quota for missing data that are considered acceptable, and whether reviewers should focus more on the reasons for the missing data or their magnitude. It also does not address the common case of authors reporting ‘no missing data’. Research-on-research studies are needed to evaluate whether this new version of the tool results in improved reproducibility.
This analysis of risk of bias assessment for more than 1600 trials included in more than one reviews showed that agreement remains suboptimal. Most disagreements come from a difference in interpretation of an incomplete or unclear description in the study report. In some cases, the difference in the assessment was due to some but not all review authors obtaining additional information, from a protocol or from contacting study author.
We thank Camila Olarte Parra for her help during the data management phase and her comments on this manuscript. We thank David Tovey, editor in chief of the Cochrane Library, for agreeing to share data from Cochrane reviews; Javier Mayoral Campos, system administrator; the Cochrane Central Executive for preparing files; and all Cochrane reviewers who collected data. We also thank Laura Smales for English revision of the manuscript.
Contributors LB was involved in the study conception, selection of trials, data extraction, data analysis, interpretation of results and drafting the manuscript. PB was involved in the study conception, data analysis, interpretation of results and drafting the manuscript. IA was involved in the study conception, data extraction and drafting the manuscript. PR was involved in the study conception and drafting the manuscript. AD was involved in the study conception, selection of trials, data extraction, data analysis, interpretation of results and drafting the manuscript.
Funding This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no 676207.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Raw data and analyses are available on request from the authors.
Patient consent for publication Not required.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.