Assessing the non-inferiority of prosthesis constructs used in total and unicondylar knee replacements using data from the National Joint Registry of England, Wales, Northern Ireland and the Isle of Man: a benchmarking study

Objectives To investigate the relative performance of knee replacement constructs compared with the best performing construct and illustrate the substantial variability in performance. Design A non-inferiority study. Setting England and Wales. Participants All primary total and unicondylar knee replacements performed and registered in the National Joint Registry between 1 April 2003 and 31 December 2016. Main outcome measures Kaplan-Meier failure function for knee replacement constructs. Failure difference between best performing construct (the benchmark) and other constructs. Methods Using a non-inferiority analysis, the performance of knee replacement constructs by brand were compared with the best performing construct. Construct failure was estimated using the 1-Kaplan Meier method, that is, an estimate of net failure. The difference in failure between the contemporary benchmark construct and all other constructs were tested. Results Of the 449 different knee replacement constructs used, only 27 had ≥500 procedures at risk at 10 years postprimary, 18 of which were classified as inferior to the benchmark by at least 20% relative risk of failure. Two of these 18 were unicondylar constructs that were inferior by at least 100% relative risk. In men, aged 55–75 years, 12 of 27 (44%) constructs were inferior by at least 20% to the benchmark at 7 years postprimary. In women, aged 55–75 years, 8 of 32 (25%) constructs were inferior at 7 years postprimary. Very few constructs were classified as non-inferior to the contemporary benchmark. Conclusions There are few knee replacement constructs that can be shown to be non-inferior to a contemporary benchmark. Unicondylar knee constructs have, almost universally, at least 100% worse revision outcomes compared with the best performing total knee replacement. These results will help to inform patients, clinicians and commissioners when considering knee replacement surgery.


AbstrACt
Objectives To investigate the relative performance of knee replacement constructs compared with the best performing construct and illustrate the substantial variability in performance. Design A non-inferiority study. setting England and Wales. Participants All primary total and unicondylar knee replacements performed and registered in the National Joint Registry between 1 April 2003 and 31 December 2016. Main outcome measures Kaplan-Meier failure function for knee replacement constructs. Failure difference between best performing construct (the benchmark) and other constructs. Methods Using a non-inferiority analysis, the performance of knee replacement constructs by brand were compared with the best performing construct. Construct failure was estimated using the 1-Kaplan Meier method, that is, an estimate of net failure. The difference in failure between the contemporary benchmark construct and all other constructs were tested. results Of the 449 different knee replacement constructs used, only 27 had ≥500 procedures at risk at 10 years postprimary, 18 of which were classified as inferior to the benchmark by at least 20% relative risk of failure. Two of these 18 were unicondylar constructs that were inferior by at least 100% relative risk. In men, aged 55-75 years, 12 of 27 (44%) constructs were inferior by at least 20% to the benchmark at 7 years postprimary. In women, aged 55-75 years, 8 of 32 (25%) constructs were inferior at 7 years postprimary. Very few constructs were classified as noninferior to the contemporary benchmark. Conclusions There are few knee replacement constructs that can be shown to be non-inferior to a contemporary benchmark. Unicondylar knee constructs have, almost universally, at least 100% worse revision outcomes compared with the best performing total knee replacement. These results will help to inform patients, clinicians and commissioners when considering knee replacement surgery.

IntrODuCtIOn
Over 90 000 knee replacements are performed annually in England and Wales, and there is a bewildering choice of total and unicondylar knee replacement (UKR) options available from which clinicians and patients can choose. When patients are considering a knee replacement, it is understandable that many assume that the different constructs function equally. However, all constructs are not equal as evidenced by variation in revision rates between brands and knee construct types. 1 The National Joint Registry for England, Wales, Northern Ireland and the Isle of Man (NJR) is the largest arthroplasty database in the world and publishes the unadjusted cumulative failure rates of the most commonly used constructs in knee replacement surgery. This is a useful format for measuring absolute failure but does not allow easy head-to-head strengths and limitations of this study ► Data presented from the largest joint registry in the world. ► A novel and systematic comparison of the performance of knee constructs to a contemporary benchmark knee construct. ► Unambiguous presentation of data will facilitate the consenting process for patients and allows surgeons and policy makers to be more informed with respect to success and failure of different constructs options available in knee replacement. ► Residual and unmeasured confounding factors are likely to be present. ► Potential for selection bias whereby certain constructs may be implanted for particular indications and in particular groups with different risks.
Open access comparison of constructs. Benchmarking bodies such as the Orthopaedic Device Evaluation Panel (ODEP) in the UK, 2 NOV in the Netherlands 3 and the Australian superior clinical performance programme 4 compare construct performance against externally set benchmarks but do not perform head-to-head comparison. Although it is reassuring a certain standard has been met, this simple dichotomisation does not facilitate head-to-head comparison and the sample sizes used are arbitrarily set. Sayers et al 5 recently proposed a method of comparison for joint replacement constructs using a non-inferiority design against an external benchmark. However, the primary limitation of this method remains the arbitrary requirement for an externally specified benchmark. In a non-inferiority clinical trial, 6 investigating failure, two treatments (comparator and reference) can be directly compared to ensure that the comparator treatment is within a clinically acceptable range (non-inferiority margin) of performance at a specified point in time. 7 8 Standard methods for assessing non-inferiority could be applied in an orthopaedic benchmarking setting, assuming an appropriate comparator, non-inferiority margin and time of interest can be identified. This is a method we have applied in a medical device setting, namely, hip replacements using NJR data, in which we assessed the non-inferiority of hip replacement constructs as compared with a benchmark construct. 9 Choosing an appropriate outcome and contemporary reference is difficult. There is no single outcome, no gold standard or evidence from randomised trials that suggests any construct outperforms all others; therefore, the choice is more heuristic. Patients would like to receive the best available care and clinicians would like to provide the best possible care, or at least care that is non-inferior to the best. A binary, unambiguous, endpoint such as revision surgery represents one potential outcome of interest. Therefore, the natural choice of reference is the construct with the lowest failure rate. However, in order to protect against chance, the construct should be used in large enough numbers to mitigate sampling variability. The failure rate of a construct is influenced by both age and gender; therefore, the choice of reference should reflect this specificity. The selection of an appropriate time and non-inferiority margin to assess construct performance is more subjective. For example, construct survivorship in the long term is less relevant to older patients with shorter life expectancy, where improved quality of life, reduction in pain or improved physical functioning maybe more relevant.
The aim of this study is to investigate the relative performance of knee replacement constructs as compared with the best performing contemporary construct, the benchmark, using a non-inferiority study design and to illustrate the substantial variability in performance of widely used constructs. This research focuses on total knee replacements (TKRs) and unicondylar knee replacements (UKRs) as these are commonly used and therefore there is sufficient data to make robust comparisons. TKRs and UKRs are examined against non-inferiority margins of 20% relative risk and 100% relative risk at 3, 5, 7 and 10 years following surgery. This is predicated on our belief that patients would at least want reassurance that the construct they are to receive is not estimated to be 100% worse than the best alternatives when used in patients with the same gender and age as them.

Patients and data sources
We identified all patients with a primary TKR or UKR registered in the NJR between 1 April 2003 and 31 December 2016. All patients were consented to be included in the NJR as part of the standard NJR process.
Procedures were excluded if the patient age or gender were missing, or the National Health Service number was untraceable and therefore mortality unknown. Procedures where the constraint or fixation type were unknown were excluded from the analysis. Patellofemoral replacements were also removed owing to the low number in the sample.
Brands of constructs are usually subdivided by fixation, mobility of the bearing and degree of constraint. NJR Annual Report data have shown that these characteristics influence revision rates, and thus, we treated each subdivision as a separate construct.

Patient involvement
Patient representatives sit on the committee structure of the NJR. The research priorities of the NJR are identified by this committee structure and approved by the patient representatives. Patients were not involved in the setting of the research question or the outcome measures, nor were they involved in designing or implementing this work or interpretation of the results. We are unable to disseminate results of this study directly to study participants due to the anonymous nature of the data. We plan to disseminate our findings to the NJR, via their communications team, to consultations relevant to the provision of joint replacement and to the general population through the local and national press. statistical methods Using a non-inferiority analysis, the performance of knee constructs were compared with an internally identified benchmark group. Construct failure was estimated using the 1-Kaplan Meier method, that is, an estimate of net failure, which is appropriate when the risk of revision is considered. 10 Failure is defined using the first linked surgical revision, where revision was defined as any addition, removal or modification of an implant to a joint that had previously undergone a TKR or UKR. Patients were censored at death or administratively censored on 31 December 2016. The difference in stratum specific failure probabilities compared with the benchmark were calculated at 3, 5, 7 and 10 years for all constructs, stratified by gender, Open access and stratified by gender and age group (<55 years, 55-75 years and >75 years).
The difference and 95% CI of the difference between the comparator construct and the benchmark construct was estimated at the specified time points. We used a 95% CI, as is the convention in the majority of medical research. The SE of the difference was constructed using a pooled estimate of the Greenwood SE 11 and a z-test comparing the difference between the benchmark and test construct was then constructed using, The stratum specific contemporaneous benchmark construct was selected as the knee construct (TKR or UKR) with the lowest failure rate with at least 1000 patients at risk at the time point of interest. The choice of 1000 procedures of the same construct was based on simulation work by Sayers et al, which demonstrated that 1000 procedures at risk will give rise to a CI width of approximately 3% (±1.5%). 5 We believe this is the minimal level of accuracy to be considered a suitable reference standard.
Two non-inferiority margins were chosen to illustrate the sensitivity of the choice. The first margin was conservatively set at a 20% increase in relative risk of failure compared with the benchmark, in line with clinical trials using this methodology, although towards the upper end. 12 The second was a 100% increase in relative risk, that is, a doubling in cumulative probability of failure, as this is an easily interpretable outcome.
Results are graphically reported for all comparator constructs with at least 500 patients still at risk at the beginning of the time point of interest. Results are also reported in a tabular format for all comparator constructs with at least 250 patients at risk at the beginning of the time point of interest (see online supplementary tables).
Constructs were either classified as non-inferior, inconclusive or inferior. If the upper CI is less than or equal to the 20% non-inferiority margin, the construct was non-inferior. If the lower CI of the difference was greater than the non-inferiority margin at either 20% or 100%, the construct was classed as inferior at 20% or 100%, respectively. If the lower confidence limit is less than the non-inferiority margin, and the upper confidence is greater than non-inferiority margin, the evidence against the construct was described as inconclusive (see figure 1 for graphical representation of the classification). All analyses were carried out using Stata V.15.1.

sensitivity analysis
We repeated all analyses using the best performing knee construct at 10 years with at least 1000 still at risk in the stratum of interest as the benchmark at the 3, 5 and 7 year time points. Open access relevant subdivisions and sensitivity analyses, results will be described more broadly. Each stratification of age group, gender and time point of interest are provided in online supplementary material.
Constructs are described by brand, fixation, the degree of constraint for TKR and whether the bearing was fixed or mobile. Constraint types were either unconstrained (cruciate retaining) or posterior stabilised (posterior cruciate sacrificing). The vast majority of benchmark constructs were cemented. In each subdivision of our analyses, the construct that met our benchmark criteria was a TKR that was unconstrained with a fixed bearing. However, not all total knees that were unconstrained with fixed bearings performed well as there were 15 separate brands of this type that were found to be inferior to the benchmark by at least 20% relative risk in at least one subdivision analysis.

non-inferiority: all procedures
The benchmark construct at 3 years was identified as the NexGen cemented, unconstrained TKR with a fixed bearing. There were 34 558 procedures remaining at risk at 3 years for this construct, and the failure rate was 1.10% (95% CI 1.01 to 1.20). There were 73 constructs with ≥500 procedures at risk. Thirty-nine constructs were classified as inferior to the benchmark by at least 20% relative risk of failure. Nine of the 73 were shown to be inferior by at least 100% relative risk (online supplementary figure  2). One prosthesis, the NexGen TKR with a monobloc polyethylene tibia, was non-inferior. The remaining 32 constructs were classified as non-inferiority not shown.
The benchmark construct at 5 years was identified as the Profix uncemented unconstrained TKR with a fixed bearing. There were 1910 procedures remaining at risk and the failure rate was 1.54% (95% CI 1.10 to 2.15). There were 65 constructs with ≥500 procedures at risk. Thirty-six were classified as inferior to the benchmark by at least 20% relative risk of failure. Twelve of the 36 were shown to be inferior by at least 100% relative risk (figure 2). All of the UKRs with ≥500 procedures at risk (n=8) were inferior by at least 100% relative risk. No construct could be described as non-inferior.
The benchmark construct at 7 years was identified, again, as the Profix uncemented unconstrained TKR with a fixed bearing. There were 1501 procedures remaining at risk and the failure rate was 1.77% (95% CI 1.29 to 2.43). There were 57 constructs with ≥500 procedures at risk. Thirty constructs were classified as inferior to the benchmark by at least 20% relative risk of failure. Eight of the 30 were shown to be inferior by at least 100% relative risk (online supplementary figure 3). All of the UKR constructs with ≥500 procedures at risk (n=5) were inferior by at least 100% relative risk. No construct could be described as non-inferior.
The benchmark construct at 10 years was identified as the PFC Sigma Bicondylar Knee cemented unconstrained TKR with a fixed bearing. There were 19 284 procedures remaining at risk, and the failure rate was 2.37% (95% CI 2.27 to 2.47). There were 27 constructs with ≥500 procedures at risk. Eighteen constructs were classified as inferior to the benchmark by at least 20% relative risk of failure. Two of the 18 were shown to be inferior by at least 100% relative risk (figure 3). There were only two UKRs with ≥500 procedures at risk at 10 years, both of which were inferior to the benchmark by at least 100% relative Figure 2 Difference in cumulative revision of knee constructs compared with a contemporary benchmark at 5 years, using all total knee and unicondylar replacements with ≥500 procedures remaining at risk.

Open access
risk. Two constructs were identified as non-inferior, the NexGen cemented unconstrained TKR with fixed bearing and the TKR PFC Sigma Bicondylar Knee hybrid uncemented with a fixed bearing.
Estimates for the difference in failure between the benchmark and comparator constructs with ≥250 procedure at risk at the time of interest are shown in online supplementary tables 1-4.

non-inferiority: gender specific
Gender specific non-inferiority analyses were also performed at 3, 5, 7 and 10 years after the primary operation.
At 3 years, there were no constructs that demonstrated non-inferiority in comparison with the benchmark prosthesis. Most striking is the general similarity in constructs used and their performance regardless of gender. There were 56 different constructs with at least 500 procedures still at risk in women versus 50 in men. There were 25 constructs with at least a 20% increase in relative risk in women versus 18 in men, although the increased number of inferior constructs demonstrated in women is likely owing to the slightly better performing benchmark group than in men. Three constructs were inferior by at least 100% in men, each a UKR, and eight constructs were inferior by at least 100% in women (six UKRs and two TKRs).
At 5 years, in women, there were 49 constructs with ≥500 procedures at risk. Twenty-nine of these constructs were classified as inferior to the benchmark by at least 20% relative risk of failure. Six of the 29 were shown to be inferior by at least 100% relative risk, four of which were UKRs. Similarly, in men, there were 43 constructs used with 19 inferior by at least a 20% increase in relative risk. Four of the 19 were inferior by at least 100% relative risk, all of which were UKRs. Results for men and women at 5 years with ≥250 procedures at risk can be seen in tables 1 and 2, respectively. At 7 years, the results were again similar between men and women. There were 40 different constructs used in women and 36 in men. In women, there were 19 constructs that were inferior by at least 20% relative risk, two of which were inferior by at least 100%. These were both UKRs. In men, 18 of the 36 constructs with ≥500 procedures still at risk were deemed to be inferior to the benchmark by at least 20% relative risk. Three of these 18 were inferior by at least 100% relative risk, all of which were UKRs. Two constructs were demonstrated to be non-inferior to the benchmark.
At 10 years, in both men and women, there was only one brand of UKR that had at least 500 procedures still at risk. In both instances, this was inferior by at least 100% relative risk and was the only construct to be classified as such. There were no brands found to be non-inferior to the benchmark in either men and women. In women, there were 8 of 15 constructs that were inferior to the benchmark by at least 20% relative risk. In men, 8 out of 13 constructs that were inferior to the benchmark by at least 20% relative risk. Data for gender specific stratification can be viewed in online supplementary figures 4, 5, 6 and 7 for men at 3, 5, 7 and 10 years, respectively, and online supplementary figures 8, 9, 10 and 11 for women at 3, 5, 7 and 10 years, respectively. non-inferiority: gender and age specific Subdividing procedures by age and gender highlights the paucity of information available for male or female patients <55 years compared with those ≥55 years. There is little data at 7 years, and no suitable benchmark could be found for women at 10 years in the <55 age group. There is a similar mix of construct types in each of the age groups in both men and women with cemented unconstrained TKRs with a fixed bearing the most popular type. Data for men under 55 years of age at 3, 5 and 7 years can be viewed in online supplementary figures 12, 13 and 14, respectively. Data for women under 55 years of age at   4). Two of these were inferior by at least 100% relative risk and both were UKRs. No constructs were demonstrated to be non-inferior to the benchmark. In men, there were 27 constructs meeting the threshold with 12 being classified as inferior by at least 20% relative risk ( figure 5). A single prosthesis, 1 of the 2 UKRs with at least 500 procedures still at risk, was classified as being inferior by at least 100% relative risk. One construct was found to be non-inferior to the benchmark. Data for men between 55 years and 75 years of age at 3, 5 and 10 years postprimary can be viewed in online supplementary figures 18, 19 and 20, respectively. Data for women between 55 years and 75 years of age at 3, 5 and 10 years postprimary can be viewed in online supplementary figures 21,22 and 23, respectively.
In the >75 years age group, we found only one inferior construct in both men and women at 7 years. Non-inferiority was not shown for any other prostheses, but very few met the threshold of 500 cases needed for analysis. At 10 years, only three constructs with ≥500 procedures still at risk were present in men and five in women. Non-inferiority was not demonstrated in any of these. At this time point and age group subdivision, there were no UKRs with enough procedures still at risk to be included in the analysis. Data for men >75 years of age at 3, 5, 7 and 10 years postprimary can be viewed in online supplementary figures 24, 25, 26 and 27, respectively. Data for women >75 years of age at 3, 5, 7 and 10 years postprimary can be viewed in online supplementary figures 28, 29, 30 and 31, respectively.
Data for each level of stratification for comparator constructs with at least 250 patients at risk at the beginning of the time point of interest can be viewed in the following supplemental tables. Data for all men at 3, 7 and 10 years can be seen in online supplemental tables 5, 6 and 7, respectively, and for all women at 3, 7 and 10 years in online supplemental tables 8, 9 and 10. Data for men <55 years of age at 3, 5 and 7 years can be seen in online supplemental tables 11, 12 and 13, respectively, and women <55 at 3, 5 and 7 years can be seen in online supplemental tables 14, 15 and 16, respectively. Data for men aged between 55 years and 75 years at 3, 5, 7 and 10 years can be seen in online supplemental tables 17, 18, 19 and 20, respectively, and for women aged between 55 years and 75 years at 3, 5, 7 and 10 years data can be seen in online supplemental tables 21, 22, 23 and 24, respectively. Data for men aged >75 years at 3, 5, 7 and 10 years postprimary can be seen in online supplemental tables 25, 26, 27 and 28, respectively, and for women >75 years at 3, 5, 7 and 10 years postprimary data can be seen in online supplemental tables 29, 30, 31 and 32, respectively.    It is known that TKRs as a class have lower revision rates than UKRs, which poses the question, should these two classes be directly compared? Since every patient who received a UKR could have received a TKR instead, this comparison is justified. There is no evidence to suggest that the subsequent observed revision rates would be any different if those receiving a UKR had received a TKR instead. Furthermore, ODEP (currently) do not provide a rating for UKRs so this method provides extra transparency and previously unavailable information for patients undergoing knee replacement.
Comparison with other studies and implications in light of existing evidence No other published study has performed a direct headto-head comparison of all available knee replacement constructs. The closest available data are from national registry annual reports such as the NJR, 1 which reports absolute failure of the most common constructs by age and gender. This shows that low failure rates are achieved by a number of constructs, but this does not facilitate  Open access easy direct comparison between them. Using the data presented here alongside the annual report data will greatly enhance the information available to surgeons, commissioners and to inform patient choice.
Consideration of the difference in outcome of UKR compared with TKR is complex. Using propensity score matching of registry data, Liddle et al 13 showed that UKRs have higher revision rates than TKRs but lower risks of complications. Hunt et al and Liddle et al showed UKRs to be associated with lower early postoperative mortality. 13 14 Kleeblad et al 15 performed a systematic review and meta-analysis of 49 cohort studies and found no difference in function as measured by Hospital for Special Surgery Score, Knee Society Score, Oxford Knee Score, Visual Analog Pain Scale and Western Ontario and McMaster Universities Osteoarthritis Index Score but did show higher revision rates with UKRs. Liddle et al 16 also showed no clinically important difference in Oxford Knee Scores (one point difference favouring UKR), but UKR patients were more likely to be highly satisfied 6 months after surgery. A recent study by the same group has shown that UKR is less costly than TKR particularly in older patients, who are less likely to require revision, and when performed by higher volume surgeons. 17 Our data add to this complex picture by clearly demonstrating the increased risk of revision associated with almost all brands Figure 4 Difference in cumulative revision of knee constructs compared with a contemporary benchmark at 7 years in women aged between 55 years and 75 years, using all total knee and unicondylar replacements with ≥500 procedures remaining at risk.

Figure 5
Difference in cumulative revision of knee constructs compared with a contemporary benchmark at 7 years in men aged between 55 years and 75 years, using all total knee and unicondylar replacements with ≥500 procedures remaining at risk.

Open access
of UKRs at all time points in both genders and all age groups when compared with the best performing TKRs.
The major weakness of all registry studies is selection bias whereby certain constructs may be implanted for particular indications and in particular groups with different risks. We have mitigated against this by analysing data by age and gender, the two variables with the greatest association with revision rates. Furthermore, revision thresholds may be lower in certain groups or for certain modes of failure. This study has looked at a single, but extremely important, outcome measure: revision. Other outcomes of interest such as mortality and patient-report outcome measures need to be considered when making choices about treatment. With over 900 000 subjects, this is the largest study of knee replacement published to date and comes from the largest implant registry in the world. Data entry is mandated, and data capture is extremely high (over 95%), 18 thus the findings are highly likely to be generalisable. The methods used are novel and, for the first time, allow a meaningfully direct comparison between all available constructs.

Conclusions, policy and future research implications
The use of product benchmarking has the potential to be highly informative for patients, change the practice of surgeons and influence policy makers if presented clearly and unambiguously. The implications of this research are far reaching. We are unable to definitively state which construct is the best choice for all patients, due to the presence of selection effects and residual confounding. However, we believe that the information presented here illustrates the variability, frequency and performance of different constructs currently used in clinical practice which, in turn, should be used to further inform the consenting process between the patient and the surgeon and facilitate implant selection. We believe commissioners and policy makers should consider the variability and performance of different implants in the commissioning of healthcare providers.
Patients should be actively involved in decision making about their treatment. Here we provide, for the first time, data that allow patients and clinicians to directly compare revision rates associated with the use of different UKR and TKR constructs. The information provided should be used to inform patient choice, surgical decision making and commissioning.