Objectives We sought to understand why randomised controlled trials in septic shock have failed to demonstrate effectiveness in the face of improving overall outcomes for patients and seemingly promising results of early phase trials of interventions.
Design We performed a retrospective analysis of large critical care trials of severe sepsis and septic shock. Data were collected from the primary trial manuscripts, prepublished statistical plans or by direct communication with corresponding authors.
Setting Critical care randomised control trials in severe sepsis and septic shock.
Participants 14 619 patients randomised in 13 trials published between 2005 and 2015, enrolling greater than 500 patients and powered to a primary outcome of mortality.
Intervention Multiple interventions including the evaluation of treatment strategies and novel therapeutics.
Primary and secondary outcome measures Our primary outcome measure was the difference between the anticipated and actual control arm mortality. Secondary analysis examined the actual effect size and the anticipated effect size employed in sample size calculation.
Results In this post hoc analysis of 13 trials with 14 619 patients randomised, we highlight a global tendency to overestimate control arm mortality in estimating sample size (absolute difference 9.8%, 95% CI −14.7% to −5.0%, p<0.001). When we compared anticipated and actual effect size of a treatment, there was also a substantial overestimation in proposed values (absolute difference 7.4%, 95% CI −9.0% to −5.8%, p<0.0001).
Conclusions An interpretation of our results is that trials are consistently underpowered in the planning phase by employing erroneous variables to calculate a satisfactory sample size. Our analysis cannot establish if, given a larger sample size, a trial would have had a positive result. It is disappointing so many promising phase II results have not translated into durable phase III outcomes. It is possible that our current framework has biased us towards discounting potentially life-saving treatments.
- clinical trials
- septic shock
- randomized control trials
- sample size calculation
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
This study captures large contemporary trials in severe sepsis and septic shock powered to mortality.
Includes critical care trials independent of intervention employed.
Examines evidence from sources outside of published trial manuscripts.
Applies statistical analysis to the discrepancy between anticipated and actual variable used in trial design.
The retrospective examination of effect size is limited by the proposed errors in sample size calculation.
The mortality from severe sepsis and septic shock has fallen demonstrably over the last few years.1 Despite this, many large interventional trials in critical care, encompassing both novel therapeutics and optimised treatment strategies, have failed to confirm improved effectiveness of interventions in comparison with placebo or best routine care. Occasionally, the promise hinted at by small phase II efficacy trials has not been confirmed in larger phase III effectiveness trials.
During the planning of an interventional trial, investigators must determine both the population and the number of subjects to be enrolled. This is to ensure that the sample size is adequate to identify with reasonable statistical certainty both true differences and true lack of differences between groups, that is, avoiding both false-positive and false-negative results. To calculate the sample size, investigators identify a significance level (commonly 0.05) and a ‘power’ (commonly 80% to 90%). The significance level, often referred to as the alpha, is the chance of concluding there is evidence of a difference when actually there is not and the difference found in the trial was due to chance. Crudely, the power is the capacity of the trial to avoid erroneously reporting no difference when in fact there is a true difference between treatments. The input factors required to calculate the required sample size in a trial powered to a dichotomous outcome such as mortality are:
The control arm event rate.
The effect size.
The alpha level—to control for type 1 error (false positive).
And the power—to control for type 2 error (false negative).
Control arm event rates are commonly derived from prior trials, historical epidemiological studies or inception studies. Investigators draw inferences on outcome event rate from a population akin to their prospective study population (by employing similar inclusion and exclusion criteria). The effect size is defined as the difference in primary outcome rate between the control and intervention arms of a trial. Anticipated effect sizes are typically estimated from earlier phase ‘efficacy trials’; these data provide a prediction of the magnitude of an intervention in the trial population to be enrolled.
Given the appreciable numbers of recent trials that have not demonstrated statistically significant effectiveness, we speculated that the assumptions which investigators were making in planning their trials may have inadvertently designed in weaknesses such that the final result of the trial did not necessarily reflect the veracity or otherwise of the hypothesis being tested. To put this another way, we speculated that the assumptions used in determining trial design were erroneous and have systematically eroded the capacity of such trials to demonstrate either a true positive effect of the treatment being investigated or to confidently exclude any such treatment effect.
Thus, we sought to investigate the role of these two important estimated variables on the relative risk reduction in primary outcome measure. To do this, we examined first the difference between anticipated outcome control arm event rate used during trial planning and the actual control arm event rate in the trial report, and second the difference between the anticipated effect size of novel treatment and that subsequently reported in the trial publications.
This study was performed as a post hoc analysis of published results of clinical trials in severe sepsis and septic shock.
A MEDLINE search was performed in August 2016 to identify appropriate trials by using the medical subject heading (MeSH) terms ‘sepsis’, ‘septic shock’ and ‘randomised control trial’ with publication dates limited to 2005 to 2015. The publication list was independently accessed by two investigators (JLCW and SJB) and trials filtered against an a priori defined list of inclusion criteria (figure 1).
Those trials not matching all inclusion criteria were rejected. The two lists were then cross-checked to ensure uniformity. The list was subsequently passed on to a third sepsis expert investigator (ACG) to ensure the sample set was representative of the prevailing peer-reviewed literature. Two further trials were suggested at this point but were subsequently rejected, as the sample sizes were too small to be included in the analysis. The minimum number of enrolled subjects in trials was set at 500 for trials to be included in the analysis. This was somewhat arbitrary, but was chosen so that large efficacy studies were not included in the analysis; our objective was to explore the impact of our variables of interest in effectiveness trials with a clinical outcome measure—mortality. The statistical analysis plans from the papers identified were systematically analysed to collect information around trial power, anticipated control arm mortality and anticipated effect size. Sources of additional information included the examination of pretrial published protocols or statistical analysis plans or where necessary direct contact with the corresponding authors of the primary or protocol publications. The latter was undertaken if there were concerns that there could be potential ambiguity from our own interpretation of the published reports. These data were collated using Microsoft Excel (Microsoft, Seattle, Washington, USA), with statistical analysis performed using GraphPad Prism V.6 (GraphPad Software, La Jolla, California USA) and R V.188.8.131.52
Data were initially checked for normality using the D’Augestino-Pearson Omnibus test and subsequently analysed using t-tests or non-parametric equivalents where appropriate.
Measures of uncertainty, specifically CIs for the actual control arm mortality, intervention arm mortality and effect size, were calculated in R from the data provided in the trial reports using standard formulae.3 The differences between the actual and anticipated control arm mortality have been summarised using a random-effects meta-analysis to allow for heterogeneity between studies. The power curve (figure 5) was created in R using the standard power calculation formula for comparing two proportions.4
Patient and public involvement
Patients and public were not involved in the creation of this manuscript as it examines previously published study data.
Trials included in the final analysis
An initial MEDLINE search identified 251 articles matching the initial search criteria. Of these, 236 trials were excluded, as they did not meet the prespecified inclusion criteria; this resulted in 16 trials (figure 1).
Three trials were subsequently removed from the final analysis despite meeting the initial inclusion criteria. These were VISEP,5 APROCCHS6 and ART123.7 VISEP and APROCCHS were excluded as one trial was stopped early for safety reasons and in the other the investigational drug (drotrecogin alpha (activated)) was withdrawn during the period of the trial. The resulting publications from these two studies did not provide the necessary data for us to use to explore our research questions. ART123 was a phase IIb trial, which has now been restarted as a phase III trial (ClinicalTrials.gov NCT01598831); we were unable to acquire sufficient data to explore our research questions from the original publication or following correspondence with the primary author. The 13 trials included in the final analysis are summarised in table 1. They form the basis for our exploration of control arm mortality and effect size estimate.
The primary end point of mortality was assessed at day 28 in seven trials, day 60 in one trial, day 90 in four trials and as in-hospital mortality in one trial. Sample size projections stated in statistical methods correlated well with the numbers of patients analysed in the final published intention-to-treat analysis groups. In ProCESS,8 the sample size was recalculated after the first interim analysis, reducing the required sample size from 1950 to a total of 1341 patients at the end of enrolment. This recalculation preserved power at the same absolute effect size, and the investigators adjusted their analysis plan for the expenditure of power due to an interim analysis. The ALBIOS9 trial size increased from an initial sample size of 1350 patients to a total of 1818 patients. This occurred after the second interim analysis by the Data Safety and Monitoring Board according to an a priori agreement with the investigators (L Gattinoni personal communication).
Anticipated control arm mortality and effect size
While all of the trials included patients with severe sepsis and septic shock, there remained considerable heterogeneity in detailed entry criteria. The trials and key characteristics are summarised in table 1. Figures 2 and 3 demonstrate the difference in anticipated and actual control arm mortality rate for the trials included in our study. Overall, there was a tendency to substantially overestimate the control arm mortality, with strong evidence that the actual control arm mortality is lower than the anticipated control arm mortality (absolute difference, 9.8 percentage points; 95% CI −14.7% to −5.0%; p<0.001). In addition, there was a tendency to overestimate the treatment effect with very strong evidence that the actual effect size is smaller than the anticipated effect size (absolute difference, 7.4 percentage points; 95% CI −9.0% to −5.8%; p<0.0001), summarised in figure 4.
Taken overall, in these trials there has been a tendency to overestimate the anticipated rate of the primary outcome and overestimate the effect size of the treatment being investigated. This equates to a tendency for the trials to have an inadequate sample size resulting in less trial power than planned and increasing the risk of a type 2 error. The problem is demonstrated by the wide CIs around the estimates of mortality and effect size in table 1.
We set out to explore whether or not there was a consistent pattern of underestimating control group mortality rates and using overambitious estimates of treatment effect in the design of sepsis trials, and we chose to examine this using effectiveness trials in severe sepsis and septic shock. Our results suggest that both of these ideas are correct.
Despite the overall improvements observed in mortality,1 the outlook for patients presenting with septic shock remains frustratingly uncertain. At best, an 18.8% control arm mortality rate (ARISE) still equates to 1 in 5 of this patient cohort dying as a result of their illness.
The consistent overestimation of control arm event rate (or lower-than-anticipated actual control arm event rate) may have systematically led to undersized trials from the outset, that is, given the actual control arm mortality the trials would have been designed to include more patients. This has likely meant that there has been an increased risk of type 2 errors in many sepsis trials that could have potentially resulted in the disregarding of potentially useful treatments.
An anticipated control arm event rate will remain, with traditional prospective trial design, an estimate. Power calculations should take into account the uncertainty in this input factor. The source data on which the estimate will be based is always somewhat historical; investigators will attempt to match the proposed population of study with that of a similarly matched group of previous patients. The ‘goal posts’ are, however, moving. We know patients are increasingly doing better and while we have not been able to ascertain the exact factors that are driving these improvements, they have the potential to influence our ability to evaluate novel treatments. Even the ARISE inception study,10 which collected data in 2006 and 2007 (although for only 3 months) and generated anticipated control arm data for the early goal-directed therapy (EGDT) trial ARISE11 (enrolling in 2008 to 2014, n=1600), overestimated control arm mortality by 10% (28% vs 18.8%). There seem a number of plausible explanations. There are potential influences from the quality and assimilation of source data where investigators are not anticipating improvements in secular trends in outcome or investigators are using data from previous control arms and historical cohorts that are not sufficiently current. Furthermore, investigators may, in some cases, be bound, consciously or subconsciously, by economic constraints that limit the uppermost sample size that can feasibly be funded. The total number of subjects required in statistically evaluating an intervention, in the context of contemporary severe sepsis and septic shock mortality, is formidable. If we consider the sample size required in a trial that aims to demonstrate a relative effect size of 10% (with fixed variables of alpha (0.05) and power (70%, 80% or 90%)), then a fall in the population’s control arm mortality has a profound effect (figure 5). This relative effect size applied to a contemporary mortality figure in septic shock of 18.4% would require a trial of 13 400 patients for the intervention arm’s mortality rate to fall to 16.56% (a 10% reduction relative to the control arm).
Effect size is a variable that is difficult to assess. We have used a post hoc analysis comparing the anticipated effect sizes that investigators estimated an intervention would have with the actual event rate in the study treatment arms. Investigators will use data from prior phase II (efficacy) studies, often with subtly differing enrolment and exclusion criteria from the subsequent phase III (effectiveness) trials. Phase II trials are conducted under as best circumstances as can be achieved and with a per protocol analysis of primary endpoint. Phase III effectiveness trials appropriately test interventions in the ‘real world’ and report using the intention to treat principle. This is a sterner/more complete test of an intervention. In addition, investigators estimate what they consider will be a clinically relevant effect.
The potential for patients enrolled in trials to fare better than those in a similar non-trial population is well recognised; teasing apart this ‘Hawthorne effect’ is difficult but may be important. The greater level of monitoring and general clinical surveillance may deliver subtle benefit. Trials of process of care may lose separation between groups, and thus erode measurement of treatment effect, as staff subtly alter their behaviour in response to what they see in the active or novel treatment arm. For trials aimed at delivering overall quality improvement, often with ‘cluster’ randomisation,12 this is arguably not so important. For studies aimed at determining the absolute impact of particular interventions, including an ethnographic element to describe changes in care and some metrics of the actual ‘dose’ of the element of care under test delivered may be wise.13
In failing to demonstrate relatively large effects, these trials have not excluded small beneficial effects; however, the confident demonstration of smaller effects requires much larger trials, which may not be fundable, or the effects sufficiently persuasive to change practice. Further complexities include treatment heterogeneity across disease severity, and heterogeneity of treatment risk, which do not necessarily align; a hypothetical smaller beneficial effect may not be matched with a smaller side-effect risk.14
Using conservative estimates of event rate and effect size seems an obvious solution, however, in trial terms critical illness is ‘noisy’ and inflating numbers to overcome such noise in conventional trials is questionable. To quote David Sackett,15 “Reducing confidence intervals by increasing the size of an RCT should be your last resort”.
In the absence of consensus opinion on an appropriate effect size that would sway clinicians to employ a new entity in clinical practice, where should we draw the line? Have we reached a point where traditional randomised control trial methodology no longer provides us with informative and practice changing results? Should clinicians leading these trials, in collaboration with funders, shift towards composite endpoints (eg, long-term quality-of-life indices) instead of powering trials to a mortality benefit and would these metrics be sufficiently persuasive to change practice? Perhaps these difficulties in critical care research need to be addressed by exploring adaptive trial designs.16 17 Based on a series of a priori determined decision rules and rolling interim analyses, such trials evolve and redesign themselves as they proceed. The challenge here may be communicating the theoretical benefits of such designs to funders and review boards, and not least learning how to explain these to potential study subjects—our patients—and their families.
Contributors JLCW: substantial contribution towards conception, design analysis and interpretation of the data. AJM: substantial contribution towards analysis and interpretation of data. ACG: substantial contribution in drafting work and revising it critically for important intellectual content. SJB: substantial contribution towards conception, design analysis and interpretation of the data. All authors approved of the final version. We agreed to be accountable for all aspects of the work and ensure accuracy and integrity.
Funding This study was supported by the National Institute for Health (NIHR) Comprehensive Biomedical Research Centre based at Imperial College Healthcare NHS Trust and Imperial College London. ACG is funded by a National Institute of Health Research (NIHR) Research Professorship award (RP-2015-06-018)
Disclaimer The views expressed are those of the authors and not necessarily those of the NIHR, the NHS or the UK Department of Health.
Competing interests None declared.
Patient consent Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement The data in this paper are available from the published peer-reviewed literature.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.