Article Text

Download PDFPDF

Original research
Novel risk stratification algorithm for estimating the risk of death in patients with relapsed multiple myeloma: external validation in a retrospective chart review
  1. Roman Hájek1,
  2. Sebastian Gonzalez-McQuire2,
  3. Zsolt Szabo2,
  4. Michel Delforge3,
  5. Lucy DeCosta4,
  6. Marc S Raab5,
  7. Walter Bouwmeester6,
  8. Marco Campioni2,
  9. Andrew Briggs7
  1. 1Department of Haematooncology, University Hospital Ostrava, Ostrava, Czech Republic
  2. 2Amgen Europe GmbH, Rotkreuz, Switzerland
  3. 3Department of Haematology, University of Leuven, Leuven, Belgium
  4. 4Amgen, Uxbridge, UK
  5. 5Department of Internal Medicine, University Hospital Heidelberg, Heidelberg, Germany
  6. 6Pharmerit International, Rotterdam, The Netherlands
  7. 7Institute of Health and Wellbeing, University of Glasgow, Glasgow, Scotland, UK
  1. Correspondence to Professor Roman Hájek; roman.hajek{at}fno.cz

Abstract

Objectives and design A novel risk stratification algorithm estimating risk of death in patients with relapsed multiple myeloma starting second-line treatment was recently developed using multivariable Cox regression of data from a Czech registry. It uses 16 parameters routinely collected in medical practice to stratify patients into four distinct risk groups in terms of survival expectation. To provide insight into generalisability of the risk stratification algorithm, the study aimed to validate the risk stratification algorithm using real-world data from specifically designed retrospective chart audits from three European countries.

Participants and setting Physicians collected data from 998 patients (France, 386; Germany, 344; UK, 268) and applied the risk stratification algorithm.

Methods The performance of the Cox regression model for predicting risk of death was assessed by Nagelkerke’s R2, goodness of fit and the C-index. The risk stratification algorithm’s ability to discriminate overall survival across four risk groups was evaluated using Kaplan-Meier curves and HRs.

Results Consistent with the Czech registry, the stratification performance of the risk stratification algorithm demonstrated clear differentiation in risk of death between the four groups. As risk groups increased, risk of death doubled. The C-index was 0.715 (95% CI 0.690 to 0.734).

Conclusions Validation of the novel risk stratification algorithm in an independent ‘real-world’ dataset demonstrated that it stratifies patients in four subgroups according to survival expectation.

  • algorithm
  • relapsed multiple myeloma
  • survival
  • risk stratification
  • validation
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • The risk stratification algorithm was validated in a real-world dataset containing information on patients with symptomatic multiple myeloma from France, Germany and the UK.

  • Using real-world data from the Czech Registry of Monoclonal Gammopathies, HRs for independent predictors of overall survival were derived from a Cox model and individual patient scores were calculated for total risk.

  • The performance of the Cox regression model for predicting risk of death was assessed by Nagelkerke’s R2, goodness of fit and the C-index.

  • A comparison of the HRs across validation datasets was used to indicate the extent to which scores could be reliably interpreted in different countries/settings.

  • The current analysis is limited in that it used a mix of Cox model performance (R2, C-index) and simply computed HRs to compare risk groups; currently, there are no established measures for assessing risk stratification algorithm performance.

Introduction

The management of multiple myeloma (MM) can be challenging owing to the heterogeneous nature of patient’s characteristics, the disease course and the array of treatment regimens that patients may receive.1 2 Although many patients respond well to current first-line (1L) treatments, these are not curative and most patients will relapse or become refractory.3 4 A report from a European chart review has found that while a high proportion of patients with MM received at least one line of active treatment (95%), this decreased significantly with further lines of treatment; 61% received a second line and 15% received a fourth or further line.5 The type of treatment and treatment sequence also varied considerably at each line.5

Defining the prognosis of patients with MM is increasingly challenging. Physicians may consider a range of patient-related and disease-related factors when trying to assess MM prognosis. These can include age, time to progression, best response achieved, cytogenetics and Eastern Cooperative Oncology Group (ECOG) performance status, as well as comorbidities, previous treatment history (including efficacy and tolerability) and the type of relapse.2 6 In addition to these factors, validated prognostic and predictive tools are needed in MM to standardise risk stratification of patients and ultimately improve risk assessment.

The International Staging System (ISS) and the revised ISS (R-ISS) have been developed to indicate prognosis in MM and are based on the strongest known predictors at diagnosis.7–9 The ISS uses a three-stage classification at diagnosis to predict survival based on serum albumin and serum β-2 microglobulin (Sβ2M) levels7 8 and the R-ISS includes cytogenetic abnormalities (CA) and lactate dehydrogenase (LDH) levels in addition to Sβ2M and serum albumin levels to refine the definition of the disease stage.9 10 Neither of these tools, however, fully reflect the typical clinical approach to assessing patient prognosis at first relapse11 nor do they address the experience of the patient during newly diagnosed MM and characteristics of the patient as they relapse. The ISS and R-ISS are therefore less relevant for holistic risk assessment in this setting. The ISS and R-ISS were designed to predict survival based on parameters measured at diagnosis.7 9 Although the prognostic value of the R-ISS has been demonstrated in both newly diagnosed patients and those with relapsed or refractory MM,12 13 it does not take account of important indicators available to physicians at first relapse such as the efficacy and safety of 1L treatment. In addition, patient characteristics at initiation of second-line (2L) treatment differ significantly from those at diagnosis (eg, many patients die during frontline therapy). Therefore, there is a need for specifically designed risk assessment tools.14

Given the need for specifically designed tools to aid medical decision-making in MM at first relapse, a novel risk stratification algorithm has been developed using real-world data from the Czech Registry of Monoclonal Gammopathies (RMG) for patients with relapsed MM initiating 2L treatment.15 16 The risk stratification algorithm uses 16 predictors to stratify patients into four risk groups with profoundly different survival expectations. It is the first tool that was designed to include both frailty assessment and disease aggressiveness in a single algorithm to reflect holistic patient-specific risk assessment.

Two manuscripts have recently been published to describe the development of the risk stratification algorithm. Hájek et al provided an overview of the algorithm development and addressed how the results can be interpreted from a clinical decision-making perspective.16 Recently, a manuscript describing in detail the methodology used to develop the risk stratification algorithm has been published.15 The methodology manuscript explains the processes involved in constructing risk stratification and prognostic tools in oncology. To provide insight into the generalisability of this new risk stratification algorithm and its potential for use in clinical practice, external validation was required to evaluate its performance in independent datasets. A bespoke retrospective chart review cohort study was conducted to collate data from patients in France, Germany and the UK for this purpose. Individual data from these three countries and the pooled validation population were compared with the development cohort and differences in disease characteristics and treatment patterns were examined. In this manuscript, we report the validation of the risk stratification algorithm in detail using real-world data sources from France, Germany and the UK, with an assessment of its performance to predict the risk of death in patients with relapsed MM.

Methods

Description of the risk stratification algorithm

A Cox model was developed using a conceptual model in MM which combined a systematic literature review and physician judgement (using a Delphi process) to select candidate predictors, followed by a backward selection process using Akaike’s information criterion to identify the independent predictors of overall survival (OS) .17 HRs for OS of each predictor were derived from the Cox regression analysis. Risk scores were then calculated by multiplying the HRs for each predictor. The patient-specific score was used as a single variable to stratify patients into four risk groups; these were defined using the K-adaptive partitioning algorithm (total risk score ≤3.0, group 1; >3.0 to ≤7.0, group 2; >7.0 to 15.4, group 3; >15.4, group 4).

The risk stratification algorithm incorporated the following as predictors of OS that are available in routine clinical practice: age, albumin level, bone marrow plasma cell count, thrombocyte count, Sβ2M level, Sβ2M level at diagnosis, LDH level, LDH level at diagnosis, calcium level, time to next treatment, ECOG performance status, CA at diagnosis, extramedullary disease, new bone lesions (X-ray), refractory status and severe toxicities during 1L treatment (any grade 3 or 4 toxicity). All predictor values used in the risk stratification algorithm were measured at initiation of 2L treatment with the exception of CA, LDH and Sβ2M; the latter two predictors were measured at both diagnosis and initiation of 2L treatment. A corresponding frailty score (defined by age and ECOG performance status) and aggressiveness score (defined by all the identified predictors specifically linked to the disease characteristics) were calculated for each patient.

The data used to develop the risk stratification algorithm were sourced from the Czech RMG.18

Data analysis

Data for the validation were derived from patient medical chart audits in France, Germany and the UK. A bespoke retrospective chart review cohort study was designed to collect all of the real-world data required to validate the risk stratification algorithm. Participating physicians were oncologists, oncohaematologists and haematologists; in total, 60 physicians participated from France, 70 from Germany and 50 from the UK. Patients with symptomatic MM were documented if they were initiated on 2L antitumour drug treatment during 2013 (providing sufficient follow-up to collect survival outcomes). Relevant data were abstracted onto a study-specific case report form during the second and third quarters of 2017. Data from the individual countries were examined descriptively for population differences and pooled for the purpose of validation.

The baseline period was defined as the time between diagnosis and the initiation of 2L treatment. Patients were followed from diagnosis to death, end date of study inclusion (if not deceased) or date of last contact (if lost to follow-up). Recorded outcomes included OS, progression-free survival, time to disease progression and treatment response.

Real-world data were analysed on a descriptive basis. Continuous variables were summarised using descriptive statistics (number, mean, SD, median, minimum and maximum values). Categorical variables were reported as frequency counts and the percentage of individuals in corresponding categories. Survival outcomes were summarised with Kaplan-Meier curves and in terms of median (95% CI) survival and the restricted mean survival.

Multiple imputation was performed for missing values,19 but only for predictors in the risk stratification algorithm, and with the exception of CA. No outcomes data were imputed. Owing to the lack of methods by which Cox model performance measures may be pooled, five rounds of imputation were conducted and data in the third imputed set were analysed.

Validation procedure

The performance of the risk stratification algorithm was evaluated in terms of the predictive performance of the Cox regression model and stratification of patients for OS.

A detailed description of the statistical analysis on the performance methods has been described previously.15 The performance of the Cox regression model for predicting the risk of death was assessed according to the extent to which the variance in OS was explained by the selected predictors (Nagelkerke’s R2), as well as the discriminative power (Harrell’s concordance index; C-index): point estimate (95% CI).20 21 The discriminative power of the Cox regression model was regarded as accurate if the C-index was ≥0.70.

The performance of the risk stratification algorithm for stratifying patients in groups by OS was analysed by evaluating OS by risk group (using Kaplan-Meier curves) and HRs comparing risk groups. A comparison of the HRs across validation datasets was used to indicate the extent to which scores could be reliably interpreted in different countries/settings.

Patient frailty (based on age and ECOG performance status) and disease aggressiveness (based on all other predictors in the model) in different risk groups and the relationship between them across risk groups were also investigated.

Patient and public involvement statement

Data for the validation cohort were derived retrospectively from patient medical chart audits. As such, patients and the general public were not involved in this study.

Results

Retrospective chart review

Chart data from a total of 998 patients were collected (France, 386; Germany, 344; UK, 268). The characteristics of these patients at diagnosis, including the 16 parameters used in the risk stratification algorithm, are summarised by country in table 1. Additional parameters not included in the risk stratification algorithm are described in online supplementary table S1. Certain between-country differences were observed (not compared statistically), such as a tendency for lower ISS and higher ECOG performance status scores in France versus Germany or the UK. In the validation cohort, 46.0% of patients had a prior stem cell transplant (SCT) and 11.5% of patients had an SCT at both 1L and 2L. A summary of the characteristics that were included in the risk stratification algorithm (pooled across the three countries) has been previously published alongside data from the original RMG dataset, and a number of discrepancies highlighted.16 Proportionally more patients in the validation cohort than in the Czech development cohort had elevated LDH levels, hypercalcaemia and higher bone-marrow plasma cell counts and bone lesions, all of which are associated with an increased overall risk of death. However, this was mitigated to some extent by the fact that proportionally fewer patients in the validation cohort than in the Czech development cohort had proven refractory to thalidomide or had experienced grade 3–4 toxicity during 1L treatment.16

Table 1

Patient characteristics included in the risk stratification algorithm by country

Cox model performance analysis

The point estimate for the C-index in the validation cohort was 0.715 (95% CI 0.690 to 0.734—a score of 0.5 represents total random predictions; a score of 1 represents a perfectly discriminating model; a good discriminating model has a score of >0.70). The R2 value for the validation set was 0.283 (possible scores range between 0 and 1 for a model that explains 0%–100% of the observed variation) based on 437 events observed in 998 patients. For comparison, the R2 for the Czech development cohort was 0.253 (737 events in 1418 patients).

Stratification of patients

The distributions of patients across the four risk groups in the validation cohort and the original Czech RMG dataset have been described previously16; distribution was a little more even in the validation cohort than the RMG dataset, which was more skewed towards the lower risk groups. Patient characteristics at diagnosis and at the initiation of 2L treatment by risk group are summarised in online supplementary table S2. Differences between risk groups were identified for parameters such as age (a trend for increasing mean age with risk group) and transplant status (fewer transplants for patients in groups 3 or 4). Both ISS at diagnosis and ECOG performance status scores at 2L initiation showed a tendency for higher values with increasing risk group.

OS by risk group in the validation dataset is shown in table 2 (alongside data for the Czech development cohort for comparison). It is notable that OS was considerably longer in the validation dataset than in the original Czech development cohort. As was the case with the original Czech development cohort, there was clear differentiation between HRs for OS between the risk groups in the validation dataset. The HRs for differences in OS between patients in group 1 and in groups 2, 3 and 4 were 1.87, 4.61 and 8.51, respectively.

Table 2

Base case analysis (imputed dataset 3) of overall survival in the validation cohort and in the development cohort

Kaplan-Meier plots for OS for the pooled validation set (figure 1A) and by country (figure 1B), as well as by risk group for the validation (figure 1C) and development (figure 1D) cohorts, are shown. The OS curve for the pooled set is immature, as the median was not reached during follow-up (figure 1A). This was also the case in the French and German cohorts (figure 1B) and, as noted above, in risk groups 1 and 2 (figure 1C). Median OS in the UK cohort was 49.2 months.

Figure 1

Kaplan-Meier plots for overall survival in (A) the pooled validation cohort, (B) by country and (C) by risk group (imputed dataset 3); (D) by risk group in the development cohort. NA, not available; OS, overall survival; Ref, reference value.

Disease aggressiveness and patient frailty were key components of classification in the original development of the novel risk stratification algorithm (figure 2). A scatterplot of disease aggressiveness scores versus frailty scores in the overall validation cohort and Czech development cohort is shown in figure 2A and B, respectively, demonstrating clear stratification by risk group in each cohort. Considering the groups in sequence from 1 to 4 (online supplementary figure S1A–C), it seems that frailty scores were spread quite broadly from group 2 onwards (online supplementary figure S1B), whereas disease aggressiveness scores were only markedly spread in group 4 (online supplementary figure S1C). Notably, consideration of frailty and disease aggressiveness in group 4 patients shows that the frailest patients are not necessarily experiencing the most aggressive disease, and vice versa. It could be expected that patients with both a high frailty and high aggressiveness score may not have been able to survive to reach 2L therapy.

Figure 2

Frailty versus disease aggressiveness score by risk group in the pooled dataset: (A) validation cohort; (B) development cohort. RSA, risk stratification algorithm.

Univariate Cox models showed that the HRs of death associated with each unit increase in the total risk scores, aggressiveness scores and frailty scores for the validation cohort were 1.018, 1.101 and 1.341, respectively (table 3). Some differences were observed between the frailty and aggressiveness scores among countries; for example, frailty scores were higher in the UK cohort than in the French or German cohort (1.674, 1.302 and 1.428 for UK, France and Germany, respectively).

Table 3

Univariate Cox model score in the pooled validation cohort and by country

Discussion

The novel risk stratification algorithm was developed using data from one of the largest existing registries of patients with MM and monoclonal gammopathies of unknown significance. The RMG contains detailed information on a large number of patient characteristics and disease-related parameters recorded at diagnosis and at first relapse. It has mature OS data and is representative of the national and international patient populations. In order to ensure generalisability of this tool to all patients with relapsed MM, however, it was important to validate the tool in an independent cohort. We therefore conducted a bespoke retrospective chart review cohort study in order to collate data from patients in France, Germany and the UK to collect data for this purpose. The pooled validation population from these three countries was around two-thirds of the size of the development cohort and demonstrated significant differences in disease characteristics.16 Despite these differences, results were consistent between the two datasets, which highlights the ability of the risk stratification algorithm to perform independent of patient heterogeneity.

Differences were identified between the patient characteristics at diagnosis across the three countries in the validation dataset. ISS suggested, for example, that patients in France had less severe disease than in Germany or the UK, while patients in the UK and France were more likely to have undergone transplantation than those in Germany. The heterogeneity of the validation population overall supports its value in demonstrating the generalisability of the risk stratification algorithm for use in practice. It is interesting to note that there was not complete agreement between the scoring systems; nearly one-fifth of patients classified as ISS group III fell into risk groups 1 and 2, for example, while 13% of those in the lowest risk category based on ISS at diagnosis were identified by the risk stratification algorithm as being in the top two risk groups.

The C-index for the validation dataset (0.715), which was similar to the Czech development cohort (0.723),15 indicated that the Czech development cohort demonstrated accurate discriminative power across the validation dataset; the discriminative power of the Cox regression model could be regarded as accurate if the C-index was ≥0.70.21 The Nagelkerke’s R2 values were similar, suggesting that the Czech development model was able to explain variance in OS to a comparable extent in the two cohorts.

With regards to stratification of the risk groups, the Kaplan-Meier curves were shown to separate from early on in both the development and validation cohorts. There was little overlap of the Kaplan-Meier estimates for OS in the four groups during follow-up (figure 1C), or of the 95% CIs for HRs comparing each risk group versus group 1, indicating how well the four groups were differentiated. A similar trend was observed in the development cohort (figure 1D). As risk groups increased, the HRs doubled in both the development and validation cohorts, demonstrating consistency in risk stratification of patients. In the Czech development cohort, median OS reduced by half as risk groups increased; in the validation cohort, median OS was not reached in groups 1 and 2 and reduced by half in the higher risk groups. Adjustment for 2L treatment in the validation dataset had little impact on HRs for OS (data not shown). To put the HR results into context, it is worth considering corresponding data from the R-ISS. Palumbo et al reported HRs of 3.68 and 9.95 for R-ISS groups II and III, respectively, versus group I.9

Comparing the same risk groups in the validation and Czech RMG populations revealed notable differences in median OS in the two cohorts. However, detailed consideration of outcomes in the two populations was impeded, to an extent, by the fact that median OS was not reached in several groups in the validation population during the 60-month follow-up. The higher OS values in the validation cohort may imply a somewhat healthier patient population than the Czech RMG cohort, but in terms of deaths related to relapsed MM that conclusion is not supported by some of the differences in predictive characteristics already described. Variation in OS would be expected in different populations, owing to modifications in treatment regimens or in other factors such as comorbidities and lifestyle, but importantly the trend in survival expectations associated with the risk stratification algorithm risk-group stratification was common to both populations, as can be seen by the similarity of the HRs in each risk group when estimated relative to group 1 patients in the same population (table 3). Median OS in all risk groups would be expected to increase as treatment regimens improve, but in the absence of a cure, the stratification process will be clinically useful when paired with an understanding of outcomes associated with different treatments in each risk group. This validation process can boast a number of design strengths. Critically, the use of a large, heterogeneous validation population, with demonstrated variability in populations across three different countries, ensured robustness of the validation approach. This population differed from the Czech development population with which the tool was initially developed, so the validation represented a good test of generalisability. The chart data were also recent, in order to maximise the relevance of the validation to current clinical practice.

There were also, inevitably, certain limitations of the study. In terms of the representativeness of the sample, as mentioned earlier it appears that the external validation cohort may have been healthier than the original Czech RMG population, based on OS estimates. It is intended that the risk stratification algorithm will be validated in further groups of patients with relapsed MM in order to ensure that the tool’s performance has been evaluated across a wide spectrum of patients. Validation in Greek22 and Slovakian populations is underway; these feature relatively short OS measurements, and thus will help to address the issue of the disproportionately healthy current validation cohort. From a methodological perspective, the current analysis used a mix of Cox model performance (R2, C-index) and simply computed HRs comparing risk groups, as there are no established measures for assessing risk stratification algorithm performance. Finally, in terms of the data obtained, there were a high number of missing values for certain variables, such as Sβ2M at diagnosis and LDH and albumin at 2L. These particular parameters are not typically used to guide treatment, and thus are rarely tested in routine practice; such issues are difficult to avoid in studies relying on real-world data.

Conclusion

The risk stratification algorithm has now been validated in an independent European cohort. Consistent results in risk stratification have been demonstrated between the validation and development cohorts, in terms of HR and median OS differences across risk groups. This is the first specifically designed patient risk assessment tool that combines both frailty and aggressiveness metrics into a single score. Patient-specific risk as assessed by this tool can be used as a prognostic factor to tailor management strategies for patients, based on burden of disease, capacity to benefit, urgency to treat and to aid decisions on the intensity of therapy.

Acknowledgments

Medical writing support was provided by Kim Allcott PhD of Oxford PharmaGenesis, Oxford, UK, and was funded by Amgen Europe GmbH

References

Footnotes

  • Contributors SG-M, LD and WB designed and performed the research study. All authors analysed the data and contributed to writing the paper by providing guidance and comments on its content and they critically revised the manuscript and agreed to the final version.

  • Funding This work was funded by Amgen Europe GmbH

  • Competing interests RH has received research funding from Amgen and Celgene, consultancy fees from Amgen, Celgene and Takeda, and honoraria from Amgen, Bristol-Myers Squibb and Janssen. SG-M, ZS and MC are employees of Amgen Europe and stockholders in Amgen. MD has received research funding from Celgene and Janssen, and honoraria from Amgen, Bristol-Myers Squibb, Celgene, Janssen and Takeda. LD is an employee of Amgen and a stockholder in Amgen. MSR has received research funding from Amgen and Novartis, consultancy fees from Amgen, Bristol-Myers Squibb, Celgene, Takeda and Novartis, and has participated in advisory boards for Celgene, Bristol-Myers Squibb, Amgen and Janssen. WB is an employee of Pharmerit International, which received funding from Amgen to conduct this research. AB has received consultancy fees from Amgen in relation to the work reported here.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Patient consent for publication Not required.

  • Ethics approval The retrospective chart review used in this study was approved by the Ethics commission of the Medical Faculty Heidelberg, Germany.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available upon reasonable request. The datasets generated and analysed during the current study are available from the corresponding author on reasonable request.