Article Text
Abstract
Objective Individual patients with the same condition may respond differently to similar treatments. Our aim is to summarise the reporting of person-level heterogeneity of treatment effects (HTE) in multiperson N-of-1 studies and to examine the evidence for person-level HTE through reanalysis.
Study design Systematic review and reanalysis of multiperson N-of-1 studies.
Data sources Medline, Cochrane Controlled Trials, EMBASE, Web of Science and review of references through August 2017 for N-of-1 studies published in English.
Study selection N-of-1 studies of pharmacological interventions with at least two subjects.
Data synthesis Citation screening and data extractions were performed in duplicate. We performed statistical reanalysis testing for person-level HTE on all studies presenting person-level data.
Results We identified 62 multiperson N-of-1 studies with at least two subjects. Statistical tests examining HTE were described in only 13 (21%), of which only two (3%) tested person-level HTE. Only 25 studies (40%) provided person-level data sufficient to reanalyse person-level HTE. Reanalysis using a fixed effect linear model identified statistically significant person-level HTE in 8 of the 13 studies (62%) reporting person-level treatment effects and in 8 of the 14 studies (57%) reporting person-level outcomes.
Conclusions Our analysis suggests that person-level HTE is common and often substantial. Reviewed studies had incomplete information on person-level treatment effects and their variation. Improved assessment and reporting of person-level treatment effects in multiperson N-of-1 studies are needed.
- perseonalized medicine
- N-of-1 studies
- systematic review
- heterogeneity of treatment effect
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Statistics from Altmetric.com
Strengths and limitations of this study
Our analysis suggests that person-level heterogeneity of treatment effects (HTE) is common and often substantial.
Our analysis was limited by the paucity of N-of-1 studies in the literature and by the low statistical power in the available studies.
Multiperson N-of-1 studies are the best design to estimate individual patient treatment effects and compare the variation in effects between individuals to variation within individuals across different periods.
Introduction
Clinicians commonly observe that individual patients given the same treatment for the same condition appear to respond differently from one another. This observation, combined with our understanding of the complex mechanisms of diseases and therapies and the potential importance of myriad patient-specific factors (eg, age, sex, illness severity, comorbidities, co-treatments and molecular differences influencing pharmacokinetics and dynamics), has led to a widely held assumption that the observed variation in treatment response seen between individuals is not merely random, but stable and potentially predictable. This assumption underpins the field of personalised medicine, which aims to determine the best treatment for an individual patient, as opposed to treating all patients with the intervention found to be most effective for the ‘average’ patient.
Nevertheless, statistical analyses aimed at discovering heterogeneity of treatment effects (HTE) among groups of individuals (eg, subgroup analyses of parallel arm randomised trials) typically fail to find compelling and reliable evidence for the presence of such heterogeneity. For example, statistically significant differences in treatment effects between men and women are often reported, but a systematic review indicates that the frequency of these interactions across studies suggests that the vast majority occur by chance.1 Similarly, the field of pharmacogenetics, also built on the assumption of stable variation in treatment responses, has largely failed to live up to its promise to broadly improve the targeting of drugs—particularly outside the special case of oncology (where studies generally depend on the subclassification of tumour tissue not on variation in germ line polymorphisms).2 3 This failure to find reproducible HTE has supported the contrarian notion that true individual effects may be a ‘myth’, an overinterpretation of random noise.4
To distinguish between these two possibilities, Kalow et al 5 have suggested that carefully designed series of N-of-1 studies could be performed for those chronic conditions amenable to this design (ie, where the disease process is relatively stable over time, treatment effects are transient and outcomes vary and are observable over time). By estimating individual patient treatment effects and comparing the variation in effects between individuals to variation within individuals across different periods, it is possible to determine the non-random component of heterogeneity in individual treatment effects—even if one is unable to identify the variables that predict this variation (ie, even in the absence of group-level HTE, such as men vs women or old vs young).
A recent review summarised N-of-1 studies reported in the literature—including multiperson N-of-1 studies—but did not examine whether and how these studies provide information on person-level HTE. Therefore, our objectives are (1) to summarise the conduct and reporting of assessments of variation in person-level treatment effects from N-of-1 studies and (2) to extract, reanalyse and report the results from the subset of studies that provided adequate data in their published reports to examine the extent of the evidence for person-level HTE (ie, participant-level outcomes or effects).6
Methods
This review was conducted in accordance with the highest standards for conducting systematic reviews.7 8 We defined N-of-1 studies as crossover trials in which each patient receives two or more treatments in a predefined, often randomised, sequence.
Data sources and searches
We used two separate searches because N-of-1 studies can be indexed differently: (1) a search in Medline, Cochrane Central and EMBASE using terms related to repeated crossover studies (for publications indexed from inception to 17 August 2017) and (2) a Medline, Cochrane Central, EMBASE and Web of Science search using terms that are related to N-of-1 (for publications indexed from 2011 to 17 August 2017). For N-of-1 studies indexed before 2011, we used studies included in a prior published systematic review by Gabler et al.6 Our searches combined terms and Medical Subject Headings for N-of-1, single-subject, single-patient, randomised trials, crossover, multiperiod crossover and rotated or repeated period crossover (see online Supplementary appendix tables 1 and 2 for detailed search terms). The searches were not restricted by disease, condition, organ system or treatment.
Supplementary file 1
Study selection
We selected eligible multiperson N-of-1 studies to describe the frequency of reporting of individual outcomes and effects and of documented HTE in these studies. We required a minimum of two individual subjects per study for evaluation of HTE. We excluded studies that included non-pharmacological interventions, reviews, abstracts and protocols. We included studies with placebo or ‘no treatment’ interventions. Citations were double screened by reviewers using an open-source, online software Abstrackr (http://abstrackr.cebm.brown.edu/). Full-text articles of potentially relevant studies were again double screened for eligibility.
Person-level outcomes were defined as outcomes for each person at each point in time when they were measured, reported in tables, text or graphs. Person-level treatment effect was defined as contrasts of outcomes in individuals on one treatment versus the comparator. Person-level HTE was defined as quantified variation in the person-level treatment effects, whereas HTE more broadly includes any type of subgroup analysis (eg, males vs females; older vs younger) as outlined in figure 1.
Data extraction and quality assessment
One of the four reviewers extracted data from each publication; a second reviewer verified all numerical information and basic descriptors of the study design and analysis. Operational definitions for extraction items were discussed in weekly project meetings and discrepancies between extractors were resolved by consensus with senior authors (DK, GR, EB). From each study, we extracted bibliographic information, details related to study design (number of patients enrolled, selection criteria, interventions evaluated, randomisation methods, outcomes assessed, follow-up duration), information on patient characteristics and person-level measurements of outcomes or estimates of person-level treatment effects (with corresponding measures of their uncertainty). When necessary, we extracted data by digitising the graphs and the values were estimated using Engauge Digitizer V.2.14 (http://digitizer.sourceforge.net/). We assessed the methodological quality of each study based on predefined criteria, in accordance with the Agency for Healthcare Research and Quality suggested methods and the Cochrane risk of bias for clinical trials.9 10
We generated graphs showing the trajectory of response for each patient in each study and compared them against the published information. We also generated scatterplots of measurements over time for studies that did not present their data in graphical format to help us identify aberrant data points (eg, errors in data extraction). We verified potentially aberrant data points by re-examining the published data and made corrections, when needed.
Data synthesis and analyses
We examined the degree to which studies reported person-level data. This was described using the following items for each reported outcome: (1) qualitative descriptions of HTE (eg, ‘there were eight responders and four non-responders’); (2) details of person-level outcomes (ie, outcomes with each treatment within each period); (3) details of person-level treatment effect (ie, a point estimate of contrasts of outcomes in individuals on one treatment vs the comparator); (4) reporting of person-level statistical effect estimate (eg, SD, exact p values or CIs for treatment effects within individuals); (5) description of statistical tests examining HTE (ie, tests evaluating the contrast of treatment effects between individuals or groups in the study) and (6) claims of HTE. Note that qualitative descriptions of HTE for item 1 would include any description that implied that treatment effects varied, whereas item six required a more definite study conclusion (eg, ‘our results demonstrate significant variation across individuals in response to treatment X’), whether or not these conclusions were based on robust statistical tests.
Statistical HTE analysis of extracted study results
We performed statistical analysis testing for person-level HTE on all studies presenting person-level data. We used a consistent analytic strategy across studies, to the extent permitted by the reporting in published papers. Our strategy was different for studies that reported person-level outcome measurements and those that reported estimates of person-level treatment effects with their sampling variances (or adequate information to approximately calculate these statistics).
For studies that only reported (or allowed the calculation of) estimates of person-level treatment effects, we obtained an average effect using a fixed effect inverse variance model and estimated the variance of the person-level treatment effects using DerSimonian and Laird method of moments estimator.11 12 In addition to a fixed effect model, we also obtained an average effect using a random-effects model. Finally, we tested the hypothesis that all person-level treatment effects were equal using Cochran’s χ2 test and quantified the proportion of observed variation due to ‘true’ person-level effect heterogeneity with the I2 statistic.13
For studies that reported person-level outcomes, we developed a linear model (for continuous outcomes) or generalised linear model (for binary or count outcomes) using the outcome of interest as the response, the intervention(s) as a covariate and indicator variables for different study participants.14 This model estimates a common treatment effect across participants. We also derived a similar model with treatment-by-participant interactions. This model allows each patient to have a different effect. The statistical significance of person-level HTE was assessed by a likelihood ratio test comparing the two models. In addition to a fixed effect model, we also fit a hierarchical linear or generalised linear mixed model with a random intercept and a random slope (for the treatment effect) to estimate the average treatment effect across all patients (assuming person-level HTE). We tested the hypothesis that all person-level treatment effects were equal and quantified the proportion of observed variation due to ‘true’ person-level effect heterogeneity with the I2 statistic.13 For modelling within-patient variance, we used a common variance with an uncorrelated covariance structure, as was used in a prior N-of-1 study.14 Person-level treatment effect was assumed to be equal across time periods. For the treatment effect, we used more than one random slope when more than two treatments were compared.
Patient and public involvement
Patients and the public were not involved in the design or analysis of this study.
Results
The searches for repeated crossover studies identified 11 891 citations and those for N-of-1 studies identified 3819 citations (indexed from 2011 onwards). Of these, we retrieved 407 full-text articles for review plus 100 N-of-1 trial articles (indexed before 2011) from an existing systematic review.5 On full-text screening, 62 studies (58 multiperson N-of-1 studies and four repeated period crossover studies) met eligibility criteria (online supplementary appendix tables 3) and are reported multiperson N-of-1 studies throughout the article. An outline of the search and study selection flow is provided in figure 2.
Description of studies
Table 1 summarises the 62 multiperson N-of-1 studies that were published between 1986 and 2017 reporting a total of 1974 patients. The most common clinical domains in the multiperson N-of-1 studies were neurology (16%), arthritis/rheumatology (10%) and psychiatry (9%). Most studies were described as ‘double blind’ but details about the methods for blinding were often unclear; similarly studies often provided unclear information about the generation of the randomisation sequence and allocation concealment (online supplementary appendix tables 4). Among the studies, 93% compared a pair of treatment strategies, 5% compared three strategies and 2% compared four strategies. Studies had between three and 16 treatment periods and obtained an average of 1–42 outcome measurements per period. Across reported outcomes, 89% of the assessed outcomes were patient reported and 11% were investigator assessed.
Reporting person-level outcomes, effects and HTE
While most studies (92%) had some qualitative acknowledgement that the treatment effects appeared to vary across individuals, formal reporting at the participant level was variable (table 2). Person-level outcomes under each treatment were reported in 52% of multiperson N-of-1 studies. Person-level treatment effects with quantitative data (comparing outcomes on each treatment) for each individual who completed the trial was available in 32%; and details on the statistical evaluation of these effects (as SD or exact pvalues or confidence intervals) were available in 13 (21%) multiperson N-of-1 studies. Only five (8%) studies described statistical tests examining any HTE. However, only two studies (3%) reported person-level HTE, whereas the others examined group-level HTE using conventional subgroup analysis based on observable characteristics.
Reanalysis of person-level data
Of the 62 studies, there were 36 studies that provided person-level data, either as outcomes in each treatment period or as person-level treatment effects (table 3). Of these, only 25 studies provided person-level data sufficient to support re-analysis: 14 studies provided person-level outcomes; 13 studies provided person-level treatment effects (two studies provided both). The remaining 11 studies reported either medians or means without data on variance or did not provide sufficient information on completers, so they could not be reanalysed for treatment effect or HTE.
Of 13 studies (with 27 unique comparisons) that reported analysable person-level treatment effect data (table 3), 10 studies had a placebo comparator and three studies had an active comparator. The sample size ranged from 7 to 68; average crossover periods ranged from 6 to 16 days and average outcome measures per period ranged from 1 to 21. The average treatment duration ranged from 14 to 336 days.
There were 14 studies (with 27 unique comparisons) that reported analysable person-level outcome data (table 3), including two studies also reporting person-level treatment effects. Of these, 11 compared the intervention with placebo and three studies compared two active interventions. The sample size ranged from 2 to 22; the average number of crossover periods ranged from 3 to 10 and the average number of outcome measures per period ranged from 1 to 42. The average treatment duration ranged from 9 to 210 days.
Reanalysis of studies reporting estimates of person-level treatment effects
Thirteen studies (including 27 comparisons, due to multiple outcomes in some studies) reported estimates of person-level treatment effects sufficient to analyse (online supplementary appendix figures 1–16 display graphs of the person-level treatment effect data). Average fixed effect estimates for each analysis are shown in table 4; random-effects estimates were generally similar (online supplementary appendix tables 5). In 8 of the 13 studies (62%) and 15 of the 27 total unique comparisons (56%), we found evidence of statistically significant HTE for at least one outcome (table 4). Generally, the magnitude in the variation of individual patient effects (as seen in the range) was very large compared with the average effects. Most studies (64%) showed person-level effects that differed qualitatively from one another. Most of the variation in the observed individual effects was attributable to ‘true’ (non-random) heterogeneity of person-level effects; 11 of 27 analyses had I2 >80%.
Reanalysis of studies reporting person-level outcome measurements
Because some of the 14 studies providing analysable outcome data had multiple outcomes (or multiple outcomes scales), there were a total of 27 comparisons with analysable data. (The online supplementary appendix figures 17–42 displays graphs of the person level outcome results.) Average fixed effect estimates for each analysis are shown in table 5; random effects estimates were generally similar (online supplementary appendix tables 6). In eight of the 14 studies (57%) (17 of the 27 unique comparisons (63%)), there was statistically significant person-level HTE for at least one outcome. Again, the variation in individual effects was often large compared with the average effect. However, given the lower number of participants per study and periods per participant and also different analytic approach, estimates of I2 2 were much less precise in these studies.
Discussion
This review documents that multiperson N-of-1 studies rarely examine HTE. Only 8% of 62 multiperson N-of-1 studies described statistical tests examining HTE, but these generally involved comparisons of treatment effects among groups of patients (eg, based on age or sex) rather than across individuals. Only two studies in the whole of the literature tested for person-level HTE.15 16 Nevertheless, analysable person-level results are sometimes reported in multiperson N-of-1 studies, as outcomes or as treatment effects, suitable for the analysis of person-level HTE. Our reanalyses of the totality of available data from these studies (n=25) suggested the presence of substantial non-random variation in treatment effects across individuals in most studies. This was evident when considering statistical tests for the variation of treatment effects among patients and also by qualitative assessment of the magnitude of effect variation. This represents the first broad empirical examination with reanalysis of person-level HTE across multiperson N-of-1 studies, and it provides some general support for the a priori assumption of individual patient variation in treatment response that broadly motivates personalised medicine.
In contrast to parallel-group studies that establish efficacy in a group of patients with a common condition, N-of-1 studies establish the effects of an intervention in an individual.17 In this respect, N-of-1 studies can be thought of as adjuncts to clinical care, where the goal is to select the right treatment for a particular patient, rather than as a research tool, where the goal is to create new generalisable knowledge.18 19 Indeed, the results of traditional N-of-1 studies may be generalisable only to the future treatment response of the patient in the trial, not to other patients. Nevertheless, using Bayesian meta-analytic techniques, Zucker et al showed how the average treatment effect at the population level can also be estimated by combining multiperson N-of-1 studies testing similar interventions in similar patients with the same outcome measures.14 Similar Bayesian methods have also been suggested for analysis of group-level HTE.20
Herein, we demonstrate yet a new application of N-of-1 studies, to explore person-level HTE. This application has important research and clinical implications, even when the determinants of HTE remain unidentified. It is particularly of interest that there was apparent variation in the degree of person-level HTE found across conditions and treatments. Since the degree of variation across individuals sets the upper bound for the amount of HTE that might be explainable by observable characteristics, such as clinical or genomic variables, searching for subgroup effects in the absence of person-level HTE is a futile exercise.4 21 22
An interesting example of how person-level HTE can vary across different conditions comes from the study of Johannessen et al (figure 3).15 These investigators conducted N-of-1 patient studies comparing cimetidine to placebo for patients presenting with dyspeptic symptoms and reported person-level effects by subgroups of disease categories. Among 46 trial completers, cimetidine had a significant effect for most patients (57%), as it did at the aggregate level. However, not only was there substantial person-level HTE, but person-level HTE varied across conditions, being much more pronounced in non-ulcer dyspepsia (I2=75%) compared with peptic ulcer disease (I2=35%) (figure 3)—despite the very similar overall effects seen in these two conditions.
Finding variation in person-level response in multiperson N-of-1 studies identifies those conditions for which N-of-1 studies are likely to be clinically relevant. For condition-treatment combinations shown to have low person-level HTE, single subject studies are highly unlikely to be clinically informative, and the average results from trials (ie, ‘one-size-fits-all’ effects) are more apt to be applicable to individuals.23 24 On the other hand, N-of-1 studies may be highly clinically informative for condition-treatments with a high degree of person-level HTE. These conditions would also be potentially higher yield for examining predictors of HTE (genomic or otherwise).
Our findings also have implications for clinical practice and formulary design. For conditions marked by high person-level HTE, even when trials show that one treatment is better on average than others, having a variety of medication options would be useful to optimise outcomes across all patients, particularly for chronic conditions such as those studied here where empiric trials of alternative medications to find the best treatment for an individual might be feasible. For example, the study by March et al 25 shows that while patients with osteoarthritis on average had less pain and less stiffness with diclofenac, some patients had improved symptoms on paracetemol. This person-level HTE may not be detectable in conventional parallel-arm trials employing conventional subgroup analysis.21
While more studies combining N-of-1 studies are needed to understand the extent of person-level HTE, future studies need to apply greater methodological rigour to improve the state-of-the-science on evaluation of individual treatment effects.26 While the recently published Consolidated Standards of Reporting Trials Extension for N-of-1 trials may help improve reporting, a tabulation of all information (possibly electronically available) appears the most straightforward way to facilitate the clinical interpretation of these studies.27 Such reporting allows the inspection of trajectories over time and may reveal patterns that are not captured by regression models. Complete reporting would also facilitate the development and evaluation of methods for the analysis of single subject experiments, particularly its use to better understand the extent and importance of person-level HTE.
The limitations of this review reflect, to a large extent, the limitations of the data in primary studies. Many conditions are not amenable to the N-of-1 design (eg, because treatment effects are cumulative or because outcomes are observed only once). Further, even for conditions and treatment that are potentially amenable to this design, many important disease categories lacked published N-of-1 studies. We relied on published studies only and our analytic cohort may be an underestimation of the true prevalence of these studies—particularly for N-of-1 studies, which may frequently be conducted without the intention of future publication.
In addition, our conclusions regarding the ubiquity of HTE in the data we reanalysed should be interpreted in the context of several important limitations. First, there were only a limited number of available studies that reported data sufficient to analyse, and therefore we present only a very partial picture of the full scope of interindividual variation in effects across clinical conditions. Furthermore, among the studies that did have data, only fairly small number of patients were observed over a small number of treatment periods and we frequently had to rely on data summaries provided by the authors (eg, person-level treatment effects and their sampling variance); these data limitations precluded the use of more complex models, for example, models that account for period effects or other effects of time on the outcome.3
Our review has demonstrated that HTE remains almost totally unexplored in multiperson N-of-1 studies, which are uniquely capable of exploring variations in individual (person-level) treatment effects. Our reanalysis of the data from these studies represents the first systematic attempt to obtain empirical support for the a priori argument that treatment effects vary across individual patients, an assumption which underpins all efforts to personalise treatment selection. In this sample, person-level HTE appears to be common and large enough to be clinically meaningful; the degree of person-level HTE appears to vary across conditions and outcomes. Thus, multiperson N-of-1 studies are an under-utilised tool to identify where person-level HTE may be substantial and where efforts to find molecular or clinical predictors of response heterogeneity should be focused. In such conditions, parallel arm studies might yield results that are over-generalised for patient level decision-making.
Acknowledgments
We would like to acknowledge Issa Dahabreh, MD, MS, Assistant Professor of Health Services, Policy and Practice, Assistant Professor of Epidemiology, Brown University, for statistical advice.
We would like to acknowledge Tatum Williamson, MS, Research Assistant, Predictive Analytics and Comparative Effectiveness Center, Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, for assistance with updating literature.
References
Footnotes
Contributors GR and DMK made substantial contributions to the conception or design of the work; the acquisition, analysis or interpretation of data for the work; responsible for drafting the work or revising it critically for important intellectual content and have made an agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. All authors have given final approval of the version to be published.
Funding This work was supported by the National Pharmaceutical Council. Additional support was provided by the Patient-Centered Outcomes Research Institute (PCORI) Award (Predictive Analytics Resource Center (SA.Tufts.PARC.OCSO.2018.01.25) and the National Institutes of Health (3UL1TR001079-04S1).
Disclaimer All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee.
Competing interests None declared.
Patient consent Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement No additional data are available.