Objective Unbiased assessment of tumour response is crucial in randomised controlled trials (RCTs). Blinded independent central review is usually used as a supplemental or monitor to local assessment but is costly. The aim of this study is to investigate whether systematic bias existed in RCTs by comparing the treatment effects of efficacy endpoints between central and local assessments.
Design Literature review, pooling analysis and correlation analysis.
Data sources PubMed, from 1 January 2010 to 30 June 2017.
Eligibility criteria for selecting studies Eligible articles are phase III RCTs comparing anticancer agents for advanced solid tumours. Additionally, the articles should report objective response rate (ORR), disease control rate (DCR), progression-free survival (PFS) or time to progression (TTP); the treatment effect of these endpoints, OR or HR, should be based on central and local assessments.
Results Of 76 included trials involving 45 688 patients, 17 (22%) trials reported their endpoints with statistically inconsistent inferences (p value lower/higher than the probability of type I error) between central and local assessments; among them, 9 (53%) trials had statistically significant inference based on central assessment. Pooling analysis presented no systematic bias when comparing treatment effects of both assessments (ORR: OR=1.02 (95% CI 0.97 to 1.07), p=0.42, I2=0%; DCR: OR=0.97 (95% CI 0.92 to 1.03), p=0.32, I2=0%); PFS: HR=1.01 (95% CI 0.99 to 1.02), p=0.32, I2=0%; TTP: HR=1.04 (95% CI 0.95 to 1.14), p=0.37, I2=0%), regardless of funding source, mask, region, tumour type, study design, number of enrolled patients, response assessment criteria, primary endpoint and trials with statistically consistent/inconsistent inferences. Correlation analysis also presented no sign of systematic bias between central and local assessments (ORR, DCR, PFS: r>0.90, p<0.01; TTP: r=0.90, p=0.29).
Conclusions No systematic bias could be found between local and central assessments in phase III RCTs on solid tumours. However, statistically inconsistent inferences could be made in many trials between both assessments.
- blind independent central review
- local assessment
- oncological randomized control trials
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
To our knowledge, this is the largest literature review and pooling analysis comparing treatment effects between blinded independent central review and local assessment in phase III randomised controlled trials on solid tumours.
We performed an exhaustive literature search to include all potential studies fulfilling the inclusion criteria.
We carefully extracted the data based on the independent and double-blind principle, in order to guarantee the accuracy of the data applied for further analysis.
Compared with our study-level analysis, the analysis using individual patients’ data could be more robust.
For using trial data of both blinded independent central review and local assessment, the findings and conclusion of this research may not be generalisable for all phase III oncological randomised controlled trials, because the situation of either assessment could be unknown when trials did not implement or report both central and local assessments.
In phase III randomised controlled trials (RCTs), response-related or progression-related endpoints like objective response rate (ORR), disease control rate (DCR), progression-free survival (PFS) and time to progression (TTP) are key for reflecting treatment effects of the experimental arm and the control arm for patients with advanced solid tumour.1–3 During trials, determination of tumour response should be assessed with accuracy, which is the prerequisite of implementation with standardised response assessment criteria (eg, Response Evaluation Criteria in Solid Tumors (RECIST) and WHO) as well.
Unlike overall survival, these endpoints assessed by local investigators are more influenced by subjective factors, including variability during tumour measurement, target lesion selection, failure to diagnose new lesions and different interpretations of non-target or immeasurable lesions.4 In open-label trials, the knowledge of investigators regarding treatment assignment could influence their assessment. Even in some double-blind trials, the investigators’ knowledge may not be completely eliminated due to the adverse effects; for example, the investigators might be able to tell which treatments are assigned for their patients according to the different manifestations of treatments' adverse effects.5
Treatment effect is one of the main results considered for drug approval. If aforementioned subjective factors impact the assessment for trial endpoints, the subsequent result will overestimate or underestimate the true effect of treatments, which is called systematic bias.6 In order to detect potential bias from local investigators, blinded independent central review is requested by the regulatory authorities (eg, the US Food and Drug Administration (FDA)). During its implementation, all imaging examinations are reviewed by independent radiologists who are blinded to patients’ treatment assignments and clinical information.7 However, this mechanism has some drawbacks. It increases the burden of time and expenditure on trials. Additionally, it may introduce missing data, information censoring and the neglect of symptomatic progression. These factors could result in different discrepancy rates of central and local assessments and sometimes among central reviewers themselves, which impacts treatment effects and may even cause potential bias.4 7 8
Given the pros and cons of assessment by central reviewers, the FDA Oncology Drugs Advisory Committee discussed how to design a reliable assessment strategy for clinical trials with central review: if there is no strong evidence indicating systematic bias from two assessments, a sample-based central review could be considered in future usage instead of the complete assessment for all patients in the trials.9 This strategy may effectively reduce the complexity and implementation burden, without compromising the reliability of the RCTs.9
Accordingly, in order to understand the reliability of local assessment, as well as the necessity of central review, we conducted this literature review and analyses in order to investigate whether systematic bias existed in previous phase III RCTs on solid tumours.
Search strategy and study selection
In accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Statement,10 a PubMed search was conducted by JRZ using the dates of 1 January 2010 to 30 June 2017. The search strategy is shown in online supplementary etable 1. Inappropriate articles such as reviews, systematic reviews and/or meta analyses, guidelines and commentaries were excluded.
Eligible trials were those directly evaluating therapeutic efficacy of anticancer agents in phase III RCTs for patients with advanced solid tumour; additionally, the imaging assessment for tumour response or progression was conducted by both central reviewers and local investigators. As some authors reported their data in more than one article, we used the name and/or National Clinical Trial (NCT) number of eligible RCTs as search terms to re-search PubMed (without the time interval limitation), to find out if there were more available articles on those RCTs. Endnote X7 (Thomson Reuters, New York City, New York, USA) was used in the above process.
The process of data extraction was carried out independently and double-blindly by three reviewers (JRZ with YYZ and SYT; in blocks of 50 articles allocated at random; discrepancies were resolved by WHL). To ensure consistency between reviewers, we used the same data extraction form, piloted the data extraction by using a sample of 16 included trials and had discussions before and during the extraction process to confer how to properly extract and interpret the data.
The following characteristics of each trial were extracted: author, year, NCT number, funding source (pharmaceutical or academic), mask (open label, single blind or double blind), region (global or intracontinental), tumour type (eg, breast cancer, ovarian cancer, melanoma), study design (superiority, non-inferiority or hybrid; hybrid design includes the design of superiority and non-inferiority), number of enrolled patients, response assessment criteria (RECIST or WHO), primary endpoint (central assessed, local assessed or other) and the statistical inference of the primary endpoint according to whether the p value was lower than the probability of the type I error (positive, negative or indeterminate). We also extracted estimated treatment effects from both central and local assessments, including the OR of experimental arm ORR to control arm ORR, OR of experimental arm DCR to control arm DCR, HR of experimental arm PFS to control arm PFS, and HR of experimental arm TTP to control arm TTP. Regarding overlapped data from more than one article on one trial, we selected data based on primarily larger analysis or recently updated analysis. For PFS and TTP, if both intention to treat (or other methods with a larger population) and per-protocol population were available for trials’ treatment effects, we preferred the former in our research. According to characteristics, the risk of bias was evaluated in each trial (online supplementary efigure 1).
First, we investigated whether there were trials with statistically inconsistent inferences between two assessments in primary and secondary endpoints (including ORR, PFS and TTP). If these trials could be identified, we calculated the percentage of these trials among all our eligible trials. Statistically inconsistent inferences are defined as the treatment effect from one of the assessments (eg, central assessment) indicating significant difference (p value is lower than the probability of the type I error or the confidence interval of the treatment effect does not cross 1), but the treatment effect from another assessment (eg, local assessment) indicating non-significant difference (p value is higher than the probability of the type I error, or the confidence interval of the treatment effect crosses 1).
Furthermore, to statistically investigate whether systematic bias existed, we made a comparison of treatment effects between central and local assessments, by conducting a pooling analysis with the inverse variance method and fixed-effect model in Review Manager 5.3 (The Cochrane Collaboration, London, England). In this process, if the corresponding p value for heterogeneity was less than 0.05 or the I2 index was over 50%, we used a random-effect model instead of the fixed-effect model in order to reduce the effect of heterogeneity. The pooled OR and HR were the measure of this comparison, expressed as the ratio of central-assessed treatment effects (eg, OR of ORR, OR of DCR, HR of PFS, HR of TTP) to local-assessed treatment effects.11 The OR (of ORR or DCR) greater than 1 indicated that central review overestimated the efficacy of the therapeutic strategy in the experimental arm; while a HR (of PFS or TTP) greater than 1 indicated that central review underestimated the therapeutic efficacy of the experimental arm (compared with local assessment). Regardless of whether the ratio was higher or lower than 1, we concluded no sign of a significant systematic bias if: (1) the corresponding p value was higher than 0.05, which means the 95% CI of the pooled ratio (HR, OR) crossed 1; (2) the 95% CI of the pooled ratio was extremely tight (<5%) if the first consideration was not met. For the above summary synthesis of ORR, DCR, PFS and TTP, a funnel plot was used to estimate publication bias (online supplementary efigure 2). Furthermore, we conducted subgroup analysis based on the trial characteristics: funding source, mask, region, trial design, number of enrolled patients (based on median value of all included trials), tumour type, response assessment criteria, primary endpoint and its outcome, as well as statistical inferences between central and local assessments (consistent/inconsistent).
In order to verify the result of the pooling analysis, we conducted correlation analysis for the treatment effects between central and local assessments, by using SPSS V.23 (SPSS, Chicago, Illinois, USA). The test for normality was completed first, followed by correlation analysis with a bivariate model: if normal distribution was indicated, we estimated the correlation by the Pearson correlation coefficient; if not, the Spearman’s correlation was applied. Significant correlation was indicated when the p value was less than 0.05. The correlation between two assessments was also demonstrated in scatterplots, constructed by using Excel 2011 (Microsoft, Seattle, Washington, USA).
Patient and public involvement
Due to the nature of the literature review, we do not have patient and public involvement in this research.
Trial searching and characteristics
Summary and detailed characteristics are presented in table 1 and in online supplementary etable 2. A majority of the 100 articles were published in high-impact journals: Journal of Clinical Oncology (29), Lancet Oncology (24), New England Journal of Medicine (18), Lancet (10), European Journal of Cancer (4), Gynecologic Oncology (4), Annals of Oncology (3), Oncologist (3) and so on. In all 76 included trials, 15 trials13–17 26 27 30 31 41 48 64 67 68 90 97 101 105 109 110 reported both central-assessed and local-assessed treatment effects of ORR and DCR; among them, 14 trials13–17 26 27 30 31 41 64 67 68 90 97 101 105 109 110 had those of ORR, DCR and PFS, including one trial68 with those of ORR, DCR, PFS and TTP. Another 12 trials18 28 29 33 37 51 57 65 79 84 85 91 92 103 with both central and local assessments only contained treatment effects of ORR and PFS.
Statistically inconsistent inferences of central and local assessments
From a total of 76 included trials, 17 trials (22%) had statistically inconsistent inferences (significant difference/non-significant difference) of ORR, PFS and/or TTP between central and local assessments.17 29 33 48 57 66 68 69 79 87 97 105 110 Among these 17 trials, 2 trials29 33 had inconsistent inferences in both of the primary endpoint and secondary endpoint simultaneously. In total, there were 9 of 17 trials (53%) with significant difference based on central assessment; 5 (56%) of these 9 trials were on open-label design (table 2).
Systematic bias between central and local assessments
All comparison results of pooling analysis are presented at table 3. There was no significant difference in the treatment effects of ORR between central and local assessments (OR: 1.02 (95% CI 0.97 to 1.07), p=0.42; heterogeneity: p=0.91, I2=0%; online supplementary efigure 3). Similarly, no sign of significant difference was in DCR (OR: 0.97 (95% CI 0.92 to 1.03), p=0.32; heterogeneity: p=0.93, I2=0%; online supplementary efigure 4), PFS (HR: 1.01 (95% CI 0.99 to 1.02), p=0.32; heterogeneity: p=1.00, I2=0%; online supplementary efigure 5) and TTP (HR: 1.04 (95% CI 0.95 to 1.14), p=0.37; heterogeneity: p=0.59, I2=0%; online supplementary efigure 6). Subgroup analysis also presented no significant difference between central and local assessments, and no significant interaction effect between different elements of subgroup factors, including open label or blind design (table 3).
The strength of the correlation between central and local assessments regarding treatment effect of ORR, DCR, PFS and TTP was 0.91 (p<0.01), 0.93 (p<0.01), 0.94 (p<0.01) and 0.90 (p=0.29), respectively (figure 2).
To our knowledge, this is the largest literature review with data analyses investigating blinded independent central review and local assessment in phase III RCTs on solid tumours. Also, it is the first research article to report the statistically inconsistent inferences (significant difference or not) of primary and secondary endpoints assessed by central reviewers and local investigators. We found 22% of trials (17/76) with inconsistent inferences between central and local assessments. However, our subsequent pooling analysis and correlation analysis based on all 76 trials confirmed no sign of systematic bias between central and local assessments, regardless of funding source, mask, region, tumour type, study design, number of enrolled patients, response assessment criteria, primary endpoint and outcome, as well as trials with statistically consistent/inconsistent inferences.
Blinded independent central review is used to detect potential bias introduced by the assessment of local investigators. This consideration is based on a common assumption that local investigators might expect superior efficacy of experimental arm treatments compared with control arm treatments, especially in trials with open-label design. Interestingly, among the 17 trials with statistically inconsistent inferences between central and local assessments, more than half of those 17 studies (9/17; 53%) had a statistically significant difference in central assessment; in these 9 trials, 5 (56%) trials were based on open-label design. This means that central assessment seems to have more positive outcomes in favour of experimental treatments in an open-label design, which contradicts the above common assumption.
With respect to statistically inconsistent inferences between central and local assessments, we assume evaluation variability is one factor accounting for these. As we understand, variability could be impacted by many subjective factors, causing measurement errors or uncertainty.8 This situation occurs when one scan reviewer assesses the response status of different individual patients, as well as when several reviewers conduct the scan assessment for one trial, regardless of whether this is a central or local assessment. In this situation, the evaluation variability attenuates the treatment effect and reduces the statistical power of the clinical trials.6 8 This understanding has been verified based on 21 phase III cancer trials, demonstrating large variability but no sign of systematic bias between two assessments.112
Missing data could be another factor. It occurs when some patients do not have complete follow-up to determine progression or death, or when patients stop receiving randomised treatments or use alternative treatments before they have progression.113 In oncological clinical trials, missing data are regarded as censoring. Similar to evaluation variability, the effect of censoring would not contribute to systematic bias but could attenuate the treatment effect.113
In the trials included in our study, we consider that evaluation variability, censoring and other unmentioned factors simultaneously played a role in attenuating the treatment effects, resulting in statistically inconsistent inferences between two assessments in 17 of the 76 trials. Whereas, regardless of what causes statistically inconsistent inferences, the robustness of the trial efficacy outcome needs to be carefully considered when two assessments present statistically inconsistent inferences, especially in primary endpoint. Even though this inconsistency is unnecessary to reflect a systematic bias, it would be interesting to know how policy-makers consider the approval process for corresponding anticancer agents to the specific patients with cancer.
Considering statistically inconsistent inferences, we believe that blinded independent central review is still a useful method for controlling the risk of bias from local assessment. However, we also question the necessity of central assessment as a routine assessment method for all patients (complete-case fashion) in clinical trials. According to our research, there was no sign of systematic bias: (1) the 95% CIs of all pooled ratios in ORR, DCR, PFS and TTP crossed 1, indicating non-significant difference of the treatment effects between central and local assessments; (2) the 95% CIs were tight as well (especially in PFS), representing quite a precise estimate of the bias that should be negligible. These findings could be further confirmed by our subgroup analysis, even though a small number of the intervals are too wide to be informative due to a limited number of the trials (eg, only one trial used single blind, the OR of ORR was 1.09 (95% CI 0.61 to 1.95)).
When questioning the necessity of the complete central assessment, its drawbacks should be considered as well. First, its implementation in the complete-case fashion is very costly. Second, technically it is hard to conduct a real-time central assessment along with local assessment, to determine disease progression independently. In other words, the decision of central reviewers could be impacted by local investigators when the local investigators declare progression, and ‘progressed’ patients may start to receive subsequent-line treatments. Therefore, the progression time of these specific patients is unknown for central reviewers, which is called informative censoring.5 6 9 11 112 Third, based only on imaging information, central reviewers could not conclude progression when patients have symptomatic deterioration. Both information censoring and withdrawal of patients with symptomatic progression (because of no radiological progression in central assessment) may potentially cause bias when the final treatment effects of the experimental arm to the control arm in RCTs are calculated.5 8 114 Fourth, similar to local assessment, central assessment also shares some drawbacks, such as evaluation variability, target-lesion selection and different interpretations on non-target or immeasurable lesions.4 7
In fact, the continuous implementation of the present response assessment criteria, the RECIST and the WHO criteria, has become controversial in the new era of medicine with biomarker-driven therapies, no matter whether for central or local assessment. For instance, when patients are treated with immunotherapies, some tumour lesions might manifest a sign of tumour ‘progression’ based on the RECIST/WHO criteria before manifesting a sign of tumour shrinkage, which is called pseudoprogression.115 Pseudoprogression was initially reported by Wolchok et al. They found that by using the immune-related response criteria (irRC), at least 10% of ipilimumab-treated patients whose response status was characterised as progression disease (PD) based on the WHO criteria could have favourable survival.116 The increased lesion in one case of the study was shown by histopathology as T-cell infiltration instead of tumour proliferation when PD was considered according to the WHO criteria.116 Similar findings have been proved by another two studies that compared the assessment of irRC with RECIST V.1.1, and immune-modified RECIST with RECIST V.1.1, respectively.117 118 Even though in our subgroup analysis the comparison result of central versus local assessments did not present significant difference regardless of the RECIST and WHO criteria, these criteria deserve an improvement for biomarker-driven therapies.
Our research has several limitations. First, due to using data from RCTs with both assessments, our outcome may not perfectly match all phase III trials, especially when the trials are implemented by only one type of assessment. Another situation that needs to be considered is trials evaluating two radiological assessment methods, but eventually reporting the outcomes based only on one assessment in published articles. In this situation, a statistically positive outcome may be reported in one assessment; whereas, the ‘not-yet-reported’ outcome of another assessment might be negative. Second, we included trials covering all solid tumours instead of focusing on one specific tumour type, in that we assumed that our research outcome could not be strongly impacted by tumours’ biological characteristics when comparing specific trial processes (eg, central and local assessments) based on study-level strategy. Our subgroup analysis based on different tumour types verified our assumption.
Furthermore, individual-level data would have been the best option for our research, but we did not have access yet. However, we consider that using study-level data reported in each published article is still a good option because the aim of our research type is to investigate study-level issues. Moreover, given that the effect of informative censoring might exist on the treatment effects of PFS and TTP, we also included another important endpoint, ORR, in order to acquire a more exact understanding about whether the treatment effects of both assessments are consistent or not. In this circumstance, the effect of informative censoring could be eliminated because when assessing ORR, central reviewers and local investigators worked independently before local investigators declared progression. Lastly, even though we have done our best to minimise inconsistency during the process of data extraction, it is possible that potential errors may have accrued. Nevertheless, all reviewers have tried to ensure consistency for data interpretation.
In conclusion, we estimate that there was essentially no systematic bias between local and central assessments, as evidenced by our precisely estimated pooled ratios of OR in ORR and DCR, as well as estimated pooled ratios of HR in PFS and TTP. Despite this, we found that statistically inconsistent inferences could be made in many trials depending on whether central or local assessment was used. Considering these, we think blinded independent central review is still an irreplaceable method for controlling the risk of bias from local assessment, but its routine usage for all patients may be unnecessary in oncological randomised controlled trials.
This study is the first academic output of the Design, Implementation & Report of Oncological Randomized Controlled Trials (DIRORCT) Research Project, which aims to investigate the quality of RCTs from study design and implementation to final report through systematic review and/or further analysis. We thank Ms Xiaoru Deng (Guangzhou Medical University), Dr Peter Coogan, Professor Joseph T. Steensma, Mr Natalicio Serrano (Brown School at Washington University in St. Louis), Ms Carolyn Smith (The Writing Center at Washington University in St. Louis) for advising article searching and manuscript revision. We also sincerely thank all authors, investigators, sponsors and patients for their effort and participation in our included studies.
JZ, YZ and ST contributed equally.
Contributors JRZ, WHL and JXH conceived and designed the study. JRZ conducted article researching and selection, and quality assessment. JRZ, YYZ and SYT took the lead for data extraction, which was checked by YQC, HRL, DFC, YH, XYW, KXD, SHJ, JQZ, JXX and XZC. JRZ, YYZ, SYT, LJ, QHH, LTH, JXH, ZHX and JYW analysed and interpreted the data. JRZ drafted the manuscript, which was critically revised for important intellectual content by all authors. WHL and JXH supervised the study. All authors, including JRZ, YYZ, SYT, LJ, QHH, LTH, JXH, ZHX, JYW, YQC, HRL, DFC, YH, XYW, KXD, SHJ, JQZ, JXX, XZC, WHL and JXH, have read and approved the final manuscript.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Patient consent Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Requests for additional details regarding the study protocol may be made by contacting the author JRZ (firstname.lastname@example.org).
Presented at Some data of this research have been presented at ESMO 2016 Annual Meeting.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.