Article Text

Download PDFPDF

Non-inferiority trials: are they inferior? A systematic review of reporting in major medical journals
  1. Sunita Rehal1,2,
  2. Tim P Morris1,2,
  3. Katherine Fielding1,3,
  4. James R Carpenter1,2,4,
  5. Patrick P J Phillips1
  1. 1MRC Clinical Trials Unit at UCL, Institute of Clinical Trials and Methodology, London, UK
  2. 2MRC Clinical Trials Unit at UCL, London Hub for Trials Methodology Research, London, UK
  3. 3MRC Tropical Epidemiology Group, Department of Infectious Disease Epidemiology, London School of Hygiene & Tropical Medicine, London, UK
  4. 4Department of Medical Statistics, London School of Hygiene & Tropical Medicine, London, UK
  1. Correspondence to Sunita Rehal; s.rehal{at}ucl.ac.uk

Abstract

Objective To assess the adequacy of reporting of non-inferiority trials alongside the consistency and utility of current recommended analyses and guidelines.

Design Review of randomised clinical trials that used a non-inferiority design published between January 2010 and May 2015 in medical journals that had an impact factor >10 (JAMA Internal Medicine, Archives Internal Medicine, PLOS Medicine, Annals of Internal Medicine, BMJ, JAMA, Lancet and New England Journal of Medicine).

Data sources Ovid (MEDLINE).

Methods We searched for non-inferiority trials and assessed the following: choice of non-inferiority margin and justification of margin; power and significance level for sample size; patient population used and how this was defined; any missing data methods used and assumptions declared and any sensitivity analyses used.

Results A total of 168 trial publications were included. Most trials concluded non-inferiority (132; 79%). The non-inferiority margin was reported for 98% (164), but less than half reported any justification for the margin (77; 46%). While most chose two different analyses (91; 54%) the most common being intention-to-treat (ITT) or modified ITT and per-protocol, a large number of articles only chose to conduct and report one analysis (65; 39%), most commonly the ITT analysis. There was lack of clarity or inconsistency between the type I error rate and corresponding CIs for 73 (43%) articles. Missing data were rarely considered with (99; 59%) not declaring whether imputation techniques were used.

Conclusions Reporting and conduct of non-inferiority trials is inconsistent and does not follow the recommendations in available statistical guidelines, which are not wholly consistent themselves. Authors should clearly describe the methods used and provide clear descriptions of and justifications for their design and primary analysis. Failure to do this risks misleading conclusions being drawn, with consequent effects on clinical practice.

  • non-inferiority
  • systematic review
  • randomised controlled clinical trials
  • clinical trial

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This research clearly demonstrates the inconsistency in recommendations for non-inferiority trials provided by guidelines for researchers and this is reflected within this review.

  • It highlights missing data and sensitivity analyses in the context of non-inferiority trials.

  • It provides recommendations using examples for researchers using the non-inferiority design.

  • Justification of the choice of the margin was recorded as such if any attempt was made to do so, and so one could argue that inadequate attempts were counted as a ‘justification’; however, there was good agreement between reviewers when independently assessed.

  • Only one reviewer extracted information from all articles and therefore assessments may be subjective. However, there was good agreement when a random 5% of papers were independently assessed.

Introduction

Non-inferiority trials assess whether a new intervention is not much worse when compared to a standard treatment or care. These trials answer whether we are willing to accept a new intervention that may be clinically worse, yet still be beneficial for patients while having another advantage, such as less-intensive treatment, lower cost or fewer side effects.1 Non-inferiority and equivalence are sometimes, mistakenly, used interchangeably. Equivalence trials are designed to show that a new intervention performs not much worse and not much better than a standard intervention. Both trial designs are different to superiority trials, which aim to show that a new intervention performs better when compared to a control.

Poor trial quality can bias trial results towards concluding no difference between treatments.2 This creates more challenges in non-inferiority trials than superiority trials as such bias can produce false-positive results for non-inferiority.3–5 The increasing use of this design6–8 means that it is even more important for trialists to understand the issues around the quality in the design and analysis of non-inferiority trials.

There are several guidelines available to aid researchers using a non-inferiority design, where various considerations of the design are explained and discussed (table 1).

  1. The CONSORT extension statements1 ,9 focus on the reporting of non-inferiority trials, with the most recent 2012 statement being an elaboration of the 2006 statement.

  2. The draft FDA 20102 document focuses on all aspects and issues relative to non-inferiority trials and gives general guidance.

  3. The EMEA 2000 guideline10 discusses switching between non-inferiority and superiority designs and the EMEA 200611 guideline discusses the choice of the non-inferiority margin, taking into account two-arm and three-arm trials.

  4. The ICH E9 and E10 guidelines12 ,13 are general statistical guidance documents addressing issues for all clinical trials and designs.

  5. SPIRIT14 is a guidance document for protocols for all trial designs and includes discussions of recently developed methodology.

Table 1

Summary of guidelines

There is some inconsistency between these guidelines regarding the conduct of non-inferiority trials (table 1) that may adversely affect the overall quality and reporting of non-inferiority trials. Non-inferiority trials require more care around certain issues, and so clear guidance on how to design and analyse these trials are necessary. Some of these issues that can influence inferences made about non-inferiority are outlined below.

First, the non-inferiority margin—the value that allows for a new treatment to be ‘acceptably worse’1—is used as a reference for conclusions about non-inferiority. It is recommended by all guidelines that this margin is chosen on a clinical basis, meaning the maximum clinically acceptable extent to which a new drug can be less effective than the standard of care and still show evidence of an effect.15 However, it is unclear whether statistical considerations should also affect the choice of an appropriate margin, as recommended by the Draft FDA 2010, ICH E10 and EMEA 2006 guidelines2 ,11 ,13 (table 1). Ignoring statistical evidence from meta-analyses or systematic reviews could have the potential for researchers to choose an unrealistic margin.

Second, it is important to choose who is included in analyses for non-inferiority trials. The intention-to-treat (ITT) analysis (includes all randomised patients irrespective of postrandomisation occurrences) is preferred for superiority trials as it is likely to lead to a treatment effect closer to having no effect and so is conservative.16 For non-inferiority trials, the ITT analysis can bias towards the null, which may lead to false claims of non-inferiority.17 The alternative per-protocol (PP) analysis is often considered instead. However, given that the PP analysis allows for the exclusion of patients, it fails to preserve a balance of patient numbers between treatment arms (ie, randomisation) that ITT analysis does and can cause bias in either direction, depending on who the analysis excludes.18 Guidelines often recommend performing the ITT and PP analyses, although definitions are inconsistent (table 1). In particular, the CONSORT 2006 guidelines describe the PP analysis as excluding patients not taking allocated treatment or otherwise not protocol-adherent,1 whereas the ICH E9 guidelines state that the PP analysis is a “subset of patients who complied sufficiently with the protocol, such as exposure to treatment, availability of measures and absence of major protocol violations.”19 These obscure definitions could lead researchers to arbitrarily exclude patients from analyses. The draft FDA guidelines recommend researchers to use an ITT and as-treated analysis, although it is unclear what is meant by ‘as-treated’ as this is not defined within the guidelines. Other frequently used classifications such as modified ITT (mITT), which aims to contain ‘justifiable’ exclusions (eg, patients who never had the disease of interest) from the ITT analysis, are also defined inconsistently.20 Third, while two-sided 95% CIs are widely used for superiority trials, there is some inconsistent advice as whether to calculate 90% or 95% CIs for non-inferiority trials and whether these should be presented as one-sided or two-sided intervals (table 1).

Fourth, the handling of missing data is generally discussed for all trials but rarely in the specific context of non-inferiority trials. Methods recommended to handle missing data vary between guidelines. The ICH E9 guidelines recommend using a last observation carried forward imputation method,19 and the more recent SPIRIT guidelines recommend multiple imputation, but caution the reader that it relies on untestable assumptions14 (table 1). Methods to handle missing data often contain untestable assumptions and so, sensitivity analyses are essential to test the robustness of conclusions under different assumptions.12 However, it is unclear what sensitivity analyses are appropriate for non-inferiority trials.

Given the inconsistency between guidelines, we hypothesised that poor conduct and reporting would be associated with demonstrating non-inferiority. This review investigates the quality of conduct and reporting for non-inferiority trials in a selection of high-impact journals over a 5-year period. We also provide recommendations to aid trialists who may consider a non-inferiority design.

Methods

Medical journals (general and internal medicine) with an impact factor >10 according to the ISI web of knowledge21 were included in the review (correct at time of search on 31 May 2015), the rationale being that articles published in these journals are likely to have the highest influence on clinical practice and be the most rigorously conducted and reported due to the thorough editorial process. We searched Ovid (MEDLINE) using the search terms ‘noninferior’, ‘non-inferior’, ‘noninferiority’ and ‘non-inferiority’ in titles and abstracts between 1 January 2010 and 31 May 2015 in New England Journal of Medicine (NEJM), Lancet, JAMA, British Medical Journal, Annals of Internal Medicine, PLOS Medicine and Archives of Internal Medicine (descending impact order). From 2013, Archives of Internal Medicine was renamed JAMA Internal Medicine, and, therefore, both journals have been included in this review. All journals refer authors to the CONSORT statement and checklist when reporting. Eligibility of articles was assessed via abstracts by two reviewers (SR and TPM). Articles included were non-inferiority randomised controlled clinical trials. Articles were excluded if the primary analysis was not for non-inferiority. Systematic reviews, meta-analyses and commentaries were also excluded. Few trials were designed and analysed using Bayesian methods and were therefore excluded for consistent comparability in frequentist methods.

Before performing the review, a data extraction form was developed to extract information from articles. Information extracted was with regard to the primary outcome. The form was standardised to collect information on the year of publication, non-inferiority margin (and how the margin was justified), randomisation, type of intervention, disease area, sample size, analyses performed (how this was defined and what was classed as primary/secondary), primary outcome, p values (and whether this was for a superiority hypothesis), significance level of CIs (and whether both bounds were reported), imputation techniques for missing data, sensitivity analyses, conclusions of non-inferiority and whether a test for superiority was prespecified. Justifications for the choice of the non-inferiority margin were reviewed by two reviewers (SR and PPJP). See online supplementary material for further details on methods.

Supplemental material

A quality grading system was developed based on whether the margin was justified (yes vs no/poor), how many analyses were performed on the primary outcome (<2 vs ≥2) and whether the type I error rate was consistent with the significance level of the CI (yes vs no/unclear). Articles were classed as ‘excellent’ if all these criteria were fulfilled and were classed as ‘poor’ if none was fulfilled. Articles which satisfied one criterion were classed as ‘fair’ and articles that provided two of the three criteria were classified as ‘good’. The results of this grading were compared to inferences on non-inferiority to assess if the quality of reporting was associated with concluding non-inferiority at the 5% significance level.

Additional published online supplementary material was accessed only if it specifically referred to the information we were extracting within articles. As a substudy, all statistical methods, outcomes and sample sizes from protocols and/or online supplementary material were reviewed from NEJM as the journal is known to specifically request and publish protocols and statistical analysis plans alongside accepted publications.

Assessments were carried out by one reviewer (SR), with a random selection of 5% independently reviewed (PPJP). Any assessments that required a second opinion were independently reviewed (TPM). Any discrepancies were resolved by discussion between reviewers.

All analyses were conducted using Stata V.14.

Results

Our search found 252 articles. After duplicate publications were removed, 217 were screened for eligibility using their titles and abstracts. A total of 46 articles were excluded leaving 171 articles to be reviewed. A further three articles were excluded during the full-text review leaving 168 articles (figure 1).

Figure 1

Flow chart of eligibility of articles.

General characteristics of the included studies are summarised in table 2.

Table 2

General characteristics

Margin

The non-inferiority margin was specified in 164 (98%) articles and was justified in less than half of articles 76 (45%). The most common justification was on a clinical basis (29 (17%)), which was often worded ambiguously and with little detail. A total of 14 (8%) used previous findings from past trials or statistical reviews to justify the choice of the margin (table 3).

Table 3

Justification of choice of margin, total number of patient populations considered for analyses and patient population included in the analysis

Patients included in analysis

Over a third of articles 65 (39%) declared only one analysis (table 3 and see online supplementary table S1a). The majority of trials classed ITT analysis as primary and PP analyses as secondary (see online supplementary figure S1a). PP analyses were performed in 90 (54%) trials; of which, 11 (12%) did not define what was meant by ‘PP’ (table 3 and see online supplementary table S1b). Definitions of the PP population contained various exclusions, mostly regarding errors in randomised treatment or treatment received.

Type I error rate

Consistency between the type I error rate and CIs reported was moderate at 95 (57%) (table 4). Most articles, 69 (41%), used a one-sided 2.5% or (numerically equivalent) two-sided 5% significance level (table 5) and some used a one-sided 5% significance level, 46 (27%). The majority of articles presented two-sided CIs (147; 88%) and 19 (11%) articles presented one-sided CIs. Most two-sided CIs were at the 95% significance level: 125 (74%).

Table 4

Consistency of type I error rate with significance levels of CIs over year of publication

Table 5

Significance level of (a) type I error rate and (b) CIs for all articles by whether CI was one-sided or two-sided

Missing data and sensitivity analyses

Ninety-nine (59%) trials did not report whether or not any imputation was carried out and only 12 (7%) explicitly declared that no imputation was used. Assuming a worst-case scenario or multiple imputation were the most common methods used (table 6). The number of imputations used for multiple imputation was specified in 8 of 11 articles and 4 of 11 stated at least one of the assumptions from Rubin's rules.22 Sixty-four (38%) trials reported using sensitivity analyses to test robustness of conclusions of the primary outcome; of these, 27 (42%) were related to assumptions about the missing data (table 6).

Table 6

Reporting of (a) missing data and (b) sensitivity analyses

Study conclusions

There were seven (4%) articles that could not make definitive conclusions (noted as ‘other’; table 7). For example, if all analyses conducted had to demonstrate non-inferiority to conclude a treatment was non-inferior, and only one of the analyses did, then non-inferiority could not be concluded and could not be rejected. Non-inferiority was declared in 132 (79%) articles. Ten of these had made some reference with equivalence studies within the article (see online supplementary material).

Superiority analyses were performed in 37 (22%) trials after declaring non-inferiority; of which, 27 (73%) had explicitly preplanned for superiority analyses. p Values were reported in 98 (58%) articles; of which, 29 (30%) were testing a superiority hypothesis.

Subgroup of trials with published protocols

Additional information from protocols published by NEJM was extracted for 57 of 61 articles. Including this additional information provided by NEJM improved reporting of results across all criteria: 39 (64%) articles justified the choice of the non-inferiority margin compared to 19 (31%); most planned two or more analyses 45 (74%) compared to 37 (61%) (there were a couple of cases where two analyses were planned in the protocol but only one was stated in the published article); consistency between type I error rates and CIs was 44 (72%) compared with 36 (59%); imputation techniques were considered in 29 (48%) compared with 17 (28%) articles and sensitivity analyses were considered in 38 (62%) articles compared with 25 (41%). The majority of articles concluded non-inferiority with 8 (13%) not determining non-inferiority. A total of 14 (23%) articles concluded superiority, of which most were pre-planned (9; 64%). Few articles 8/40 (20%) presented superiority p values.

Association between quality of reporting and conclusions

Trials that were classed as having some ‘other’ conclusion about non-inferiority were excluded from the analysis. Overall, there was a suggestive difference between the quality of reporting and concluding non-inferiority: Embedded Image; p=0.05 (Cochran–Armitage test; table 7). Trials that were poorly reported were less likely to conclude non-inferiority than those that satisfied two or all criteria from justifying the choice of the margin, reporting two or more analyses or reporting a CI consistent with the type I error rate.

Table 7

Quality of reporting of trials associated with conclusions of non-inferiority

Discussion

Reporting of non-inferiority trials is poor and is perhaps partly due to disagreement between guidelines on vital issues. There are some aspects that guidelines agree on, such as a requirement for the non-inferiority margin to be justified, but we find that this recommendation is neglected by the majority of authors. It is remarkable that several authors performed only one analysis for the primary outcome and the lack of consistency between the significance level chosen in sample size calculations and the CI reported further highlights confusion of non-inferiority trials. Not knowing how to deal with missing data nor appropriate sensitivity analyses, also adds to the confusion. The combination of these recent findings assessed from high-impact journals and the inconsistency in guidelines indicate: (1) the non-inferiority design is not well understood by those using the design and (2) methods for non-inferiority designs are yet to be optimised.

We anticipated that poor reporting of articles would bias towards concluding non-inferiority; however, the poorly reported trials were less likely to demonstrate non-inferiority. This is somewhat reassuring. Nevertheless, it is essential to ensure that what is reported at the end of a trial was prespecified before the start of a trial: scientific credibility and regulatory acceptability of a non-inferiority trial rely on the trial being well-designed and conducted according to the design.23 It is possible that the quality of a trial may also depend on the quality of the outcome; unresponsive outcomes that miss important differences between treatments may be intentionally or unintentionally chosen to demonstrate non-inferiority. Therefore, it is also important that the outcome chosen is robust.

Almost 80% of studies concluded non-inferiority, although it is unclear whether this is due to the reporting in articles or publication bias. It appears that positive results (ie, alternative hypotheses) are published more often, regardless of trial design, as this number is consistent with other studies that found that more than 70% of published superiority trials demonstrated superiority.24 ,25

More than half of articles reported p values, of which approximately a third reported p values for a two-sided test for superiority. p Values, if reported, should be calculated for one-sided tests corresponding to the non-inferiority hypothesis; that is, with H0: δ=margin. p Values for superiority should not be presented unless following the demonstration of non-inferiority, where a preplanned superiority hypothesis is tested.26

Comparison with other studies

The value of the non-inferiority margin was almost always reported, but more than half of articles made no attempt to explain how the choice was justified. While justification of the margin is low, this is actually an improvement from Schiller et al27 who reported 23% articles made a justification, although this difference could be because only high-impact journals were included in this review. There were equally as many articles that planned and reported an ITT analysis compared with articles that performed ITT and PP analyses. This is surprising given that CONSORT 2006 states that an ITT analysis can bias non-inferiority trials towards showing non-inferiority.1 These results were lower than found by Wangge et al28 who reported 55% used either an ITT or PP and 42% used ITT and PP. Most articles presented two-sided 95% CIs, which is consistent with results from Le Henanff et al.29

There were very few articles that referred to preserving the treatment effect based on estimates of the standard of care arm from previous trials. It is vital that authors acknowledge this to ensure the standard of care is effective. If the control was to have no effect at all in the study, then finding a small difference between the standard of care and new intervention would be meaningless.2

Clinical considerations1 ,2 ,9 ,11–13 to justify the choice of the margin often had inadequate justifications, such as ‘deemed appropriate’ or ‘consensus among a group of clinical experts’. Non-inferiority is only meaningful if it has strong justification in the clinical context and so should be reported. If the justification includes a measurable reduction in adverse events, these should be measured and the benefit should be demonstrated. Guidelines recommend that the choice of margin should be justified primarily on clinical grounds; however, previous trials and historical data should also be considered if available. As an example, Gallagher et al30 justify the choice of the margin providing as much information as possible by including references to all published reports and providing data from the institution where the senior author is based.

A statement often used in articles reviewed was ‘the choice of the margin was clinically acceptable’. This statement does not contain enough information to justify the choice of the non-inferiority margin. If the choice of the margin is based on a group of clinical experts, authors should provide information on how many experts were involved and how many considered the choice of the margin being acceptable: a consensus among a group of 3 clinicians from 1 institution is different from a consensus of 20 clinicians representing several institutions. Radford et al31 justify the choice of the non-inferiority margin after performing a delegate survey at a symposium. This method may be a way forward for researchers to obtain clinical assessment from a large group of clinicians. Even better would be to obtain formal assessments, using, for example, the Delphi method,32 which has been used in the COMET initiative,33 after presenting the proposed research at a conference or symposium for clinicians to really engage with the question at hand.

Definitions provided by authors were inconsistent under what they classed as ITT, PP, mITT and as-treated, for example, “all patients randomised who received at least one dose of treatment” was defined at least once in each classification. According to the guidelines, the PP definition excludes patients from the analysis, but it is unclear what those exclusions are. The ambiguity of how PP is defined was evident in this review as definitions provided by authors could not be succinctly categorised.

Many articles presented only one analysis, despite most guidelines recommending at least two analyses.1 ,2 ,9 ,10 ,12 Unfortunately, guidelines differ in their advice on which of the two analyses should be chosen to base conclusions on. This regrettable, state of affairs was clearly reflected in our review.

The ITT and PP analyses have their biases and so neither can be taken as a ‘gold standard’ for non-inferiority trials. The analysis of the primary outcome is the most important result for any clinical trial. It should be predefined in the protocol what patients should adhere to and should be considered at the design stage what can be carried out to maximise adherence. It should be made clear exactly who is included in analyses given the variety of definitions provided by various authors, particularly for PP analyses where definitions are subjective. Most authors included treatment-related exclusions such as ‘received treatment’, ‘completed treatment’ or ‘received the correct treatment’. Such differences in definitions may be superficially small, but could in fact make critical differences to the results of a trial.

Poor reporting of whether the hypothesis test was one-sided or two-sided or absence of the type I error rate in the sample size calculation meant over a quarter of articles were not clearly consistent with regard to the type I error rate and corresponding CI.

Most guidelines advise presenting two-sided 95% CIs and this is what most articles presented. However, this recommendation may cause some confusion between equivalence and non-inferiority trials. A 5% significance level is maintained using 95% CIs in equivalence trials for two-sided hypotheses, whereas non-inferiority takes a one-sided hypothesis and so a two-sided 90% CI should be calculated. If a one-sided type I error rate of 2.5% is used in the sample size calculation, then this corresponds to the stricter two-sided 95% CIs, not a one-sided 95% CI.34

The power and type I error rate should be clearly reported within sample size calculations and whether the type I error rate is for a one-sided or two-sided test. For example, the CAP-START trial used a one-sided significance test of 0.05 with two-sided 90% CIs, and the authors provide exact details of the sample size calculation in online supplementary appendix.35 If presenting one bound of the CI throughout an article, this must be performed clearly and consistently as described by Schulz-Schüpke et al,36 Lucas et al,37 Gülmezoglu et al.38 Recently, JAMA have introduced a policy to present the lower bound of the CI with the upper bound tending towards infinity,39 and this has been put into practice in recent non-inferiority trials.40–43

It is unclear whether the potential issues surrounding missing data are well recognised for non-inferiority studies, given that the majority of articles did not explicitly state whether or not methods to handle missing outcome data would be considered. Most trials that used multiple imputation stated the number of imputations used but few discussed the assumptions made, which are particularly critical in this context. Some missing data are inevitable, but naive assumptions and/or analysis threaten trial validity for ITT and PP analyses,14 particularly in the non-inferiority context where more missing data can bias towards demonstrating non-inferiority.44

It is recommended for trials to clearly report whether imputation methods to handle missing data were or were not performed. If imputation was used, it should be clearly stated what method was used along with any assumptions made, following the guidelines of Sterne et al.45

Only about a third of articles reviewed reported using sensitivity analyses. There was some confusion between sensitivity analyses for missing data and secondary analyses. Sensitivity analyses for missing data should keep the primary analysis model, but vary the assumptions about the distribution of the missing data, to establish the robustness of inference for the primary analysis to the inevitably untestable assumptions about the missing data. In contrast, secondary analysis with regard to excluding patients for the primary outcome is attempting to answer a separate, secondary question.46 Thus, while EMEA 2000 and CONSORT 2012 describe this as sensitivity analysis (and many papers we reviewed followed this), in general this will not be the case, and conflating the two inevitably leads to further confusion.

The focus of the analysis for non-inferiority trials should be on patients who behaved as they were supposed to within a trial, that is the PP population, but rather than excluding patients from the PP analyses, an alternative approach would be to make an assumption about the missing data for patients who do not adhere to the predefined PP definition and then impute missing outcomes for these patients as if they had continued in the trial without deviating. Sensitivity analyses should then be used to check robustness of these results. However, currently, it is unclear what methods are appropriate to achieve this goal.

Subgroup of trials with published protocols

The mandatory publication of protocols taken from NEJM publications improved results for all criteria assessed. This reiterates the findings from Vale et al47 who evaluated the risk of bias assessments in systematic reviews assessed from published reports, but had also accessed protocols directly from the trial investigators and found that deficiencies in the medical journal reports of trials does not necessarily reflect deficiencies in trial quality. Given this, it is clear that a major improvement in the reporting of non-inferiority trials would result if all journals followed the practice. Since publication of e-supplements is very cheap, there appears to be no reason not to do this.

Strengths and limitations

This research demonstrates the inconsistency in the recommendations for non-inferiority trials provided by the available guidelines, which was also reflected within this review. We have provided several recommendations using examples for researchers wishing to use the non-inferiority design and have outlined the most important recommendations that we hope will be taken up in future guidelines (box 1). We have also highlighted the importance of missing data and using sensitivity analyses specific to non-inferiority trials. There are also some limitations in this review. First, a justification of the choice of the margin was recorded as such if any attempt was made to do so. Therefore, one could argue that inadequate attempts were counted as a ‘justification’; however, there was good agreement between reviewers when independently assessed. Second, only one reviewer extracted information from all articles and therefore assessments may be subjective. However, there was good agreement when a random 5% of papers were independently assessed, and the categorisation of the justification of the non-inferiority margin was also independently assessed in all papers where a justification was given. Third, an update of the CONSORT statement for non-inferiority trials was published during the period of the search in 2012,9 which could improve the reporting of non-inferiority trials over the next few years. However, the first CONSORT statement for non-inferiority trials published in 20061 was released well before the studies included in our search and we have found that reporting of non-inferiority trials remains poor.

Box 1

Recommendations

▸ Justification of the margin should be a made mandatory in journals.

▸ Authors should make reference to preserving the treatment effect based on estimates of the standard-of-care arm from previous trials.

▸ Presentation of the CI should be consistent with the type I error rate used in sample size calculations.

▸ Analyses should be performed to answer the question of interest (ie, the primary outcome) using additional analyses to test the robustness of that definition, rather than to heedlessly satisfy intention-to-treat and per-protocol definitions.

▸ Methods to handle missing data should be considered, and sensitivity analyses should be considered to test the assumptions of missing data made on the primary analysis.

▸ Protocols should always be published as online supplementary material and authors should make use of online supplementary material to include additional detail on methods (such as details for justifying the choice of the non-inferiority margin and full definition of analyses conducted), so that a word limit for a published article should not be an excuse for poor reporting.

Conclusion

Our findings suggest clear violations of available guidelines, including the CONSORT 2006 statement (published 4 years before the first paper in our review), which concentrate on improving how non-inferiority trials are reported and is widely endorsed across medical journals.

There is some indication that the quality of reporting for non-inferiority studies can affect the conclusions made and therefore the results of trials that fail to clearly report the items discussed above should be interpreted cautiously. It is essential that justification for the choice of the non-inferiority margin becomes standard practice, providing the information early on when planning a study including as much detail as possible. If the choice of the non-inferiority margin changes following approval from an ethics committee, justification for the change and changes to the original sample size calculation should be explicit. If journals enforced a policy where authors must justify the choice of the non-inferiority margin prior to accepting publication, this would encourage authors to provide robust justifications for something so critical given that clinical practice may be expected to change if the margin of non-inferiority is met.

Sample size calculations include consideration of the type I error rate, which should be consistent with the CIs as these provide inferences made for non-inferiority when compared against the margin. Inconsistency between the two may distort inferences made, and stricter CIs may lack power to detect true differences for the original sample size calculation. If any imputation was performed, then this should be detailed along with its underlying assumptions, supplemented with sensitivity analyses under different assumptions about the missing data. There is an urgent need for research into appropriate ways of handling missing data in the PP analysis for non-inferiority trials; once resolved, this analysis should be the primary analysis.

Information that is partially prespecified before the conduct of a trial may inadvertently provide opportunities to modify decisions that were not prespecified at the time of reporting without providing any justification. It is therefore crucial for editors to be satisfied that criteria are defined a priori. A compulsory requirement from journals to publish protocols as e-supplements and even statistical analysis plans along with the main article would avoid this ambiguity.

Acknowledgments

SR is supported by a Medical Research Council PhD studentship (grant number 1546500). TPM and JRC are supported by the Medical Research Council CTU at UCL (grant numbers 510636 and 532196).

References

Footnotes

  • Contributors SR conceived the study, carried out data extraction and analysis, and wrote the manuscript. TPM performed data extraction and critical revision of the manuscript. KF and JRC helped in critical revision of the manuscript. PPJP conducted data extraction, analysis and critical revision of the manuscript.

  • Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.