The use of benchmarks to assess the performance of implants such as those used in arthroplasty surgery is a widespread practice. It provides surgeons, patients and regulatory authorities with the reassurance that implants used are safe and effective. However, it is not currently clear how or how many implants should be statistically compared with a benchmark to assess whether or not that implant is superior, equivalent, non-inferior or inferior to the performance benchmark of interest.

We aim to describe the methods and sample size required to conduct a one-sample non-inferiority study of a medical device for the purposes of benchmarking.

Simulation study.

Simulation study of a national register of medical devices.

We simulated data, with and without a non-informative competing risk, to represent an arthroplasty population and describe three methods of analysis (z-test, 1−Kaplan-Meier and competing risks) commonly used in surgical research.

We evaluate the performance of each method using power, bias, root-mean-square error, coverage and CI width.

1−Kaplan-Meier provides an unbiased estimate of implant net failure, which can be used to assess if a surgical device is non-inferior to an external benchmark. Small non-inferiority margins require significantly more individuals to be at risk compared with current benchmarking standards.

A non-inferiority testing paradigm provides a useful framework for determining if an implant meets the required performance defined by an external benchmark. Current contemporary benchmarking standards have limited power to detect non-inferiority, and substantially larger samples sizes, in excess of 3200 procedures, are required to achieve a power greater than 60%. It is clear when benchmarking implant performance, net failure estimated using 1−KM is preferential to crude failure estimated by competing risk models.

We propose a one-sample non-inferiority design for assessing the failure rate of medical devices against an external benchmark, using arthroplasty as an exemplar.

Using a simulation study, we demonstrate that device failure rate estimated using Kaplan-Meier is appropriate and provides unbiased estimates of implant failure in the context of benchmarking arthroplasty.

The number of individuals required at the beginning of a study in order to obtain nominal power at 5 different non-inferiority margins is described.

The performance of three methods under two data generating processes is described in terms of bias, root-mean-square error, coverage and CI width.

We assume that simple analyses using 1−Kaplan-Meier derived from population registry studies can provide causal interpretations.

Arthroplasty prostheses are not currently required to undergo randomised clinical trials prior to their introduction into routine clinical practice, and postmarket surveillance is used to determine their efficacy. Without a head-to-head comparison against existing products, devices can be compared with external references or benchmarks. Product benchmarking in medical devices is common and is intended to help surgeons and healthcare administrators select safe and effective medical devices, yet there is no consensus on how this should be performed.

In arthroplasty, the process of benchmarking prosthetic implants is extensive. However, there is considerable debate with regards to the standards and criteria that should be adopted.

Benchmarking bodies around the world were quick to respond and the Orthopaedic Device Evaluation Panel (UK),

Despite the recommendations of a maximum failure rate of 5% at 10 years,

While benchmarking in arthroplasty is used as an exemplar, similar problems and arguments can be made in any medical discipline that utilises medical devices or implants, for example, cardiac or cosmetic surgery.

In a simple setting of a clinical study with no loss to follow-up and no censoring, if implant survival is calculated and it is less than or equal to the benchmark, do we conclude that the implant has reached the benchmark? That is, _{1}

The temptation is to simply rearrange

Simple hypothesis testing framework and illustration of type I and II errors, power and true negatives.

While there has been some debate with regards to the use of formal sample size calculations, which are determined on the basis of power and type I errors or by prespecifying the desired width of the CI,

However, in the context of a benchmarking system, the definition of superiority is restrictive. For example, if the true failure rate of the implant of interest is exactly equal to the benchmark

Despite the linguistic similarities between superiority and non-inferiority studies, the analysis and interpretation are different. A non-inferiority framework requires the interested parties to place limits around what could be described as non-inferior, that is, a non-inferiority margin (

Schematic representation of inferiority, non-inferiority and superiority studies.

If the failure rate was 5.5% and the CI ranged between 5.25% and 5.75%, we would still conclude that this device was clinically non-inferior, despite being statistically inferior, with a non-inferiority margin of 1%. The methods by which one should choose an appropriate non-inferiority margin are inherently subjective, and the risk of choosing too large a margin represents a risk of exposing patients to inferior and less efficacious products. This is opposed to a margin that is too small, which in turn limits products of similar performance being introduced to the market; both the Food Drug Administration

The aim of this study is to investigate the sample size required at the beginning of a study to demonstrate superiority and non-inferiority of implant failure compared with an external benchmark level of performance, that is, a one-sample non-inferiority study design in the presence of censoring, and the consequences of using three common methods of estimating failure in a simulation study.

The simulation study will be described using an Aims, Data generating process, Method of analysis, Estimands and Performance structure.

The aim of this study is to describe the sample size required to identify if a prosthetic implant has a failure rate non-inferior to an external benchmark using simple analytical solutions. In addition, we will use a simulation study to determine the power to detect superiority and non-inferiority with different sample sizes, different estimands, in the presence of a non-informative CR (mortality) and when the true implant failure rate is the same as the benchmark.

Using analytical solutions of a z-test in a non-inferiority setting (see

Two different data generating mechanisms with varying sample sizes (n=100, 200, 400, 800, 1600, 3200, 6400) were explored: (DGP1) implant failure with no censoring and (DGP2) implant failure with censoring for mortality. Implant failure and mortality data were simulated independently (non-informative censoring) from a parametric survival distribution (2-parameter Weibull) (failure:

We investigated the first DGP (no CRs) using the z-test (see

A failure function,

The second DGP was similarly investigated using three approaches: (1) a z-test, where the proportion of failures was calculated excluding those who died prior to 10 years; (2) a failure function,

All analyses were conducted in Stata (Stata Statistical Software: Release 14.1).

The estimand of interest was cumulative implant failure at 10 years and its 95% CI in a hypothetical world where implant failure is the only possible outcome.

The first DGP explicitly simulates data where implant failure is the only possibly outcome, and the analytical methods purport to estimate net failure, whereas the second DGP simulates data where there are two potential outcomes (implant failure or death). The simple proportion of failures (

Performance was assessed in the superiority study setting using bias,

Using analytic sample size calculations, non-inferiority against a benchmark sample size was calculated and is presented in

The sample size required to detect non-inferiority of the failure proportion with non-inferiority margins (δ) at 1%, 2%, 3%, 4% and 5%.

With a non-inferiority margin of 3% failure, power of 50%, 203 individuals are required at the beginning of the study, whereas with 90% power, 555 individuals are required at the beginning of the study. There is an approximately log-linear association between sample size and power between 50% and 90%, at all non-inferiority margins. However, sample size rapidly increases as the non-inferiority margin reduces.

Results from the simulation study using a superiority design are presented in

Performance characteristics of five analyses; z-test and Kaplan-Meier when there is no competing risk (NCR) and a z-test, KM and cumulative incidence function (CIF) in the presence of competing risk (CR).

The method of analysis and DGP process are indicated using five different coloured line styles. It is clear that the cumulative incidence function (CIF), which estimates crude failure, estimated in the presence of CR consistently underestimates net failure by 0.5%, whereas the z-test, an estimate of net failure, in the presence of CR overestimates failure rates. Notably, the accuracies (RMSE) in the estimates of all methods are similar, and as sample size increases, they reduce. However, CIF and z-test in the presence of CR are biased estimates of net failure in comparison to 1−KM, and RMSE does not tend to 0 with increasing sample sizes. Correspondingly, coverage of the 95% CI reduces as the sample size increases for both CIF and z-test. The CIF power to detect superiority erroneously increases as a consequence of a consistent difference in the estimand (crude vs net failure) and narrowing CIs. The width of the estimated CIs, across all methods, consistently decreases as sample size increases. Despite their homogeneity, the width of CI from CIF is approximately 1% larger than that of the 1−KM estimate in the presence of CR.

Simulation results using a z-test with no CRs were compared with analytic sample size calculations assuming no CRs (see online

Results from the simulation study using a non-inferiority paradigm are presented in

Power to detect non-inferiority at 1%, 2%, 3%, 4% and 5% below a 95% benchmark performance. The data generating process and method of analysis are presented in separate panels. The sample size is indicated on the horizontal axis. CIF, cumulative incidence function.

When no CR is present, a 3% non-inferiority margin, and at samples sizes of 200 or 800, a z-test has 46% and 94% power to detect non-inferiority, respectively, and KM has 34% and 91% power to detect non-inferiority. When a non-informative CR is present, a 3% non-inferiority margin, and sample sizes n=200 or n=800, a z-test has 22%and 44% power to detect non-inferiority, respectively; 1−KM has 26% and 86% power to detect non-inferiority, respectively; and the CIF has 48% and 99% power to detect non-inferiority, respectively.

Mean performance estimates from the simulation study are tabulated in the online

This study investigates how and how many individuals are required to demonstrate non-inferiority in the failure rate of a medical device (arthroplasty prosthesis) compared with an external benchmark in the presence or absence of a CR.

Net failure estimated using 1−KM provides unbiased estimates in the presence or absence of a non-informative CR, whereas a simple z-test or CIF overestimate and underestimate failure in the presence of a non-informative CR, respectively. While there is reasonable agreement between analytical and simulation estimates of the sample size required to conduct a non-inferiority study, the failure to incorporate an adjustment for censoring due to mortality leads to erroneous estimates of power.

Using 1−KM to estimate failure in the presence of a non-informative CR (estimating net failure), a sample size of n=1600, will ensure coverage at nominal levels, have a CI width of approximately 2.3% and a RMSE of 0.46% but have a 35%, 88%, 97%, 97% and 97% power of demonstrating non-inferiority at margins of 1%, 2%, 3%, 4% and 5%, respectively.

This study has a number of strengths. We have shown the well-known differences between net and crude failure. Despite the persistent suggestion in arthroplasty research that 1−KM overestimates implant failure, it is clear that it is an unbiased estimate of net failure but a biased estimate of crude failure, a patient’s personal chance of revision surgery. CIF (crude failure) in a CR model unsurprisingly underestimates net failure, assuming independence between CR and implant failure. We have demonstrated the number of individuals required at the onset of a benchmarking study for a variety of non-inferiority margins, under two different DGPs, and the width of the estimated CI. Despite the exemplar of arthroplasty, similar methods would apply to any discipline interested in conducting a formal benchmarking process.

However, this study has a number of important limitations. (1) We simulated data from a Weibull model with an uncorrelated CR, and while it provides a convenient and sensible method of generating data in an arthroplasty example, more complex models with correlated CRs may be appropriate in other areas. (2) We assume that time to revision surgery is a reasonable outcome of interest. However, alternative outcomes, for example, Patient Reported Outcome Measures, could similarly be incorporated into a non-inferiority benchmarking design. (3) The threshold for revision is assumed to be homogenous between different surgeons in this simulation. (4) We have not considered the effect of analysing data from multiple sources or combining pre-existing data. However, methods of meta-analysis in non-inferiority settings are well documented.

Despite some authors suggesting that a 1−KM overestimates implant survival in the presence of CR,

This balancing act is very well known and demonstrated by the seminal work of Gooley

The broad success of arthroplasty, as well as currently used benchmark failure rate of 5% at 10 years, necessitates the use of small non-inferiority margins. A small non-inferiority margin of 1% in absolute risk represents a 20% increase in relative risk of failure compared with a benchmark of 5%. The minimum numbers of individuals required to demonstrate non-inferiority of a device where its true failure rate is equal to the benchmark is unsurprisingly large. Modest sample sizes (n=1600) have limited power (35%) to detect non-inferiority, and only when sample sizes become large (n=6400) does power increase substantially (90%).

Despite differences in the performance of estimators and interpretation of estimands, the choice of sample size at the beginning of a study should be based on the desire to obtain sufficiently precise estimates (small RMSE and narrow CI’s), which mitigates type II errors for a given non-inferiority margin, which is tolerable to the public, surgeons and regulators, whereas the choice of sample size that should remain at risk at the end of the benchmarking period is somewhat more difficult to determine. Two possible reasons to ensure the numbers at risk at the end of the period are large include (1) maintaining the performance and minimising the width of CIs and (2) ensuring that sufficient numbers of patients experience 10 years of the risk of revision for the estimates and the benchmarking process to be credible.

From a conservative perspective, it is simple to request that the same number of patients for a given level of power, ^{th}patient until the benchmarking assessment is made, a process that is similar to preregistration of randomised trials. This will implicitly allow for censoring due to mortality, without prespecifying how much censoring will occur, and assumes there is no loss to follow-up (which is often assumed in arthroplasty registers).

The choice of non-inferiority margin, initial sample size or desired width of the CI in a benchmarking study are all subjective decisions and can only be chosen by balancing the risk of incorrectly awarding a benchmarking standard to an implant with a failure rate beyond the non-inferiority margin versus benchmark inflation where all devices receive benchmarks and the entire process lacks credibility. However, this study clearly demonstrates how 1−KM provides unbiased estimates of net implant failure, in a conservative scenario when the failure rate of an implant being tested is equal to the benchmark and has a CR that is uncorrelated to the event of interest.

AS conceived and designed the study. AS, MC, AJ, MW and AB interpreted the data, revised the manuscript and approved the final draft.

AS was supported by a MRC strategic skills fellowship: MRC Fellowship MR/L01226X/1.

None declared.

Not commissioned; externally peer reviewed.

No data are available to be shared.