Article Text
Abstract
Objectives Previous comparisons of the ability to detect change in the Barthel Index (BI) and Functional Independence Measure motor scale (FIMm) have implied these two scales are equally responsive when examined using traditional effect size statistics. Clinically, this is counterintuitive as the FIMm has greater potential to detect change than the BI and raises concerns about the validity of effect size statistics as indicators of rating scale responsiveness. To examine these concerns, in this study a sophisticated psychometric analysis was applied, Rasch measurement to BI and FIMm data.
Methods BI and FIMm data were examined from 976 people at a single neurorehabilitation unit. Rasch analysis was used to compare the responsiveness of the BI and FIMm at the group comparison level (effect sizes, relative efficiency, relative precision) and for each individual person in the sample by computing the significance of their change.
Results Group level analyses from both interval measurements and ordinal scores implied the BI and FIMm had equivalent responsiveness (BI and FIMm effect size ranges −0.82 to −1.12 and −0.77 to −1.05, respectively). However, individual person level analyses indicated that the FIMm detected significant improvement in almost twice as many people as the BI (50%, n=496 vs 31%, n=298), and recorded less people as unchanged on discharge (FIMm=4%, n=38; BI=12%, n=115). This difference was found to be statistically significant (χ2=273.81; p<0.000).
Conclusions These findings demonstrate that effect size calculations are limited and potentially misleading indicators of rating scale responsiveness at the group comparison level. Rasch analysis at the individual person level showed the superior responsiveness of the FIMm, supporting clinical expectation, and its added value as a method for examining and comparing rating scale responsiveness.
- Multiple sclerosis
- rehabilitation
- scales
Statistics from Altmetric.com
Introduction
Rating scales must be able to detect clinically important change if they are to be used as outcome measures in clinical trials.1–3 The relative responsiveness of competing rating scales is a critical factor in the selection of scales for studies.4 5
This study examines the responsiveness of two widely used activity limitation rating scales, the Barthel Index (BI)6 and Functional Independence Measure motor scale (FIMm),7 in 1400 people who have undergone neurorehabilitation. Previously,8 we demonstrated problems with the BI (substantial item and scale ceiling/floor effects), which cautioned against its appropriateness in evaluating neurorehabilitation. This led us to hypothesise that the BI would be more responsive if its items had more response categories. We tested this hypothesis by comparing the BI with the FIMm, a scale that uses the same items but has more item response categories.9 Results showed that the FIMm had greater potential to detect change (smaller item and total score floor and ceiling effects than the BI) and detected change in more people undergoing rehabilitation. Despite this evidence of better potential to detect change, the FIMm and BI had almost identical effect size calculations implying the same ability to detect change at the group comparison level. This finding is counterintuitive clinically and questions the validity of effect size statistics as indicators of rating scale responsiveness.
To explore this issue, we examined the relative responsiveness of the BI and FIMm in the same dataset using a more sophisticated psychometric method, Rasch measurement.10–12 This method advances the analysis of rating scale responsiveness in three specific ways. Firstly, Rasch analysis enables interval level (linear) measurements of activity limitation to be estimated from ordinal level BI and FIMm total scores. This is valuable because fixed changes in ordinal total scores (eg, 10 points) imply variable changes in interval level measurements across the scale range.2 3 13 Thus analysing total scores may hide responsiveness differences between scales. Secondly, Rasch analysis enables a legitimate examination of changes in activity limitation at the individual person level, in addition to comparisons at the group level. In contrast, traditional psychometric analyses are not recommended for individual person decision making.3 14 The third benefit is that Rasch analysis enables scales measuring the same construct, as the BI and FIMm purport, to be equated on a common metric.2 This enables people's measurements on the BI and FIMm to be compared on an identical ‘ruler’ of activity limitation.15
Methods
Sample
Data were available for 1495 people who underwent neurorehabilitation at a single UK unit. In our analyses we included cases with complete admission and discharge data, and excluded all people who had the minimum possible score or the maximum possible score on either scale at either admission or discharge. This was to ensure that results and inferences were not confounded by floor and ceiling effects. This study was approved by the local ethics committee of the National Hospital for Neurology and Neurosurgery, London, UK. Details of the sample have been reported elsewhere.8 16
BI and FIMm
Table 1 shows the BI and FIMm. The BI has 10 items. Two items have two response categories, six items have three response categories and two items have four response categories.6 The FIMm has 13 items. All items have seven response categories.7 The BI and FIMm share eight identical items. The two remaining BI items (dressing, transferring) are represented in the FIMm by five items (dressing upper body, dressing lower body, bed transfer, toilet transfer, bath transfer).
Analysis
Rasch analysis is a method of analysing rating scale data. In brief, the analysis examines the extent to which the data satisfy the requirements of a mathematical model—the Rasch measurement model.10–12 This model articulates a theory of how rating scales must perform if the values they generate are to be considered scientific measurements.3 Thus when the data fit the requirements of the Rasch model, within reason, there is evidence that scales (here the BI and FIMm) are measurement instruments. Under these circumstances the analysis is able to transform scale scores for people, which are by necessity ordinal, into interval level measurements. These estimates, termed ‘person locations’ to distinguish them from ordinal scale scores, are in log odds units (logits). For each individual person's location the analysis also generates a bespoke SE. Rasch analysis is explained elsewhere.3 11 15 17–19
Rasch analyses were performed using RUMM2020.9 We analysed BI and FIMm data together as a co-calibrated pool of items, organised in a racked (by scale) and stacked (by time point) format. We compared the responsiveness of the BI and FIMm scale at both the group and individual person levels.
Group level comparison
The relative responsiveness of the BI and FIMm was examined at the group level by comparing admission and discharge person locations using four standard indicators: two effect size calculations (Kazis' effect size20 and standardised response mean 21), relative efficiency (pairwise squared t values from paired samples t-tests22) and relative precision (ratio of pairwise F values from oneway ANOVA).5 We compared the results of these analyses, which are derived from person locations and are interval level measurements, with the results of the same analyses undertaken on BI and FIMm total scores, which are generated by summing item scores and are ordinal level data. This was to determine if estimates of responsiveness based on interval level measurements differed, in magnitude or inference, from those based on ordinal level scores.
Individual person level comparison
The relative responsiveness of the BI and FIMm was compared at the individual person level. This was achieved by computing, for each and every person, the significance of their own change in activity limitation measurement (‘Sig Change’). Firstly, we computed the size of the change score for each individual person (discharge location –admission location). Secondly, we computed that size of the error associated with their change (SE of the difference) for each individual person as the square root, of the sum, of the squared SE values at admission and discharge. Thirdly, we computed the significance of the change for each individual by dividing their change score by their SE of the difference (ie, how big is their change in SE units). Finally, we categorised the significance of each person's change into one of five groups according to the size and direction of the significance of change value. The formulae are as follows:
Significance of change values obtained from this formula were categorised into five groups:
Significant improvement=Sig Change ≥+1.96;
Non-significant improvement=0<Sig Change ≤+1.95;
No change=Sig Change=0;
Non-significant worsening=−1.95≤Sig Change <0;
Significant worsening=Sig Change ≤−1.96.
Now, we can simply count the numbers of people achieving each level of significance of change, and compare the distributions for the FIMm and BI using a χ2 test and relative risk statistics.
Results
Sample
Data were available for 1495 people. Complete data at admission and discharge were available for 1396 (93% of sample), of whom 976 (70%) did not score at either the floor or ceiling of either scale at both time points. In the total sample, at both admission and discharge, total score floor and ceiling effects were lower for the FIMm than the BI. This indicates that the FIMm provides an extended range of measurement.i As predicted for a sample of people undergoing an intervention aimed to improve function, the floor effects were smaller on discharge than admission, and the ceiling effects were larger on discharge than admission. These values were: FIMm admission (floor=0.8%, ceiling=0.3%), discharge (floor=0.1%, ceiling=1.7); BI admission (floor=1.1%, ceiling=5.3%), discharge (floor=0.2%, ceiling=27.9%). Overall, 519 people were at either the floor or the ceiling, on either scale. Of these, only 30 people (5.9%) scored were at the floor or ceiling on both scales.
Mean age and length of rehabilitation were 49 years (SD 15) and 36 days (SD 26), respectively, and 56% of the cohort was female. The main diagnostic groups were multiple sclerosis (46%), stroke (18%) and spinal cord syndromes (17%). More details of the samples have been reported previously.16
Group level comparison of BI and FIMm relative responsiveness
The responsiveness data (table 2) generated by the analysis of both interval measurements (BI and FIMm person locations) and ordinal scores (BI and FIMm scale scores) shows that both scales quantified significant changes at the group level, and that both scales had near identical responsiveness according to the four analyses. Conclusions reached about the relative responsiveness of the BI and FIMm were essentially the same for both interval measurements and ordinal scores.
Individual person level comparison of BI and FIMm relative responsiveness
Table 3 shows that the FIMm detected significant improvements in activity limitation in nearly 200 more people than the BI (50%, n=486 vs 31%, n=298) and also recorded less people as unchanged on discharge (FIM=4%, n=38; BI=12%, n=115). Importantly, these analyses cannot be undertaken legitimately on ordinal rating scale data.
Discussion
The aim of this study was to explore the consistent16 23–26 but counterintuitive finding that the BI and FIMm are equally able to detect change in activity limitation; counterintuitive because every FIMm item has seven response categories whereas corresponding BI items have between two and four categories. As such, changes in activity limitation should be more easily detected by FIMm items than BI items. This greater capacity of the FIMm to detect change should result in superior responsiveness.
This study had three major findings. The first is that the FIMm was more responsive than the BI. It detected significant improvements in very many more people (n=486 vs 298) and detected change in 67% of those considered unchanged by the BI. However, this clear demonstration of the superiority of the FIMm was only possible through individual person level analyses. These are only legitimately achieved using sophisticated methods, such as Rasch analysis.3 10 11 17
The explanation for the different responsiveness of the FIMm and BI can be seen by plotting the SE of measurement (y axis) for every level of activity limitation defined by the FIMm and BI (x axis) (see figure 1). At every activity limitation level, the SE associated with a FIMm measurement is smaller than the SE associated with the corresponding BI measurement. This is mainly because the FIMm has more item response categories. As a consequence, measurements made by the FIMm have narrower CIs than those made by the BI. Thus statistical significance is achieved with smaller changes in the FIMm than the BI.
The second important finding from this study is that the group level indicators of responsiveness (effect size, standardised response mean, relative efficiency, relative precision) did not detect the superiority of the FIMm, even when the analyses were conducted on interval measurements derived from the BI and FIMm. This finding provides further support for our suggestion16 that standard group level indicators of rating scale responsiveness are limited and may be positively misleading.
The third important finding of this study concerns the similarity of measurements generated by the BI and FIMm. One feature of Rasch analysis is that it enables rating scales measuring the same construct to be equated on an identical metric. A close look at the results in table 2 shows three things: on admission, the mean FIMm location is higher than the mean BI location; at discharge, the mean FIMm location is lower than the mean BI location; and the mean change measured by the FIMm (0.915 logits) is less than that measured by the BI (1.238 logits).ii
These three findings raise two questions. Why do the FIMm and BI produce different measurements of the same people on admission and discharge? Why does the FIMm register less mean change than the BI given that it has the greater capacity to detect change? There are a number of possible explanations. Firstly, these could occur if the FIMm and BI measured somewhat different constructs. This is unlikely as the FIMm was developed, in part, to improve on the limitations of the BI7—all items are common and Rasch analysis supports them as measures of the same construct.
A second explanation is that inherent psychometric limitations in each scale account for the findings. This is possible as Rasch analysis identifies limitations in both scales (misfitting items, disordered thresholds). A summary of the results of the co-calibrated data analysis (essentially the 10 item BI and 13 item FIMm analysed as if they were a single 23 item scale) are shown in the supplementary appendix (available online). This table shows that 10 items have disordered thresholds, most items have statistically significant misfit (examination of the item characteristic curves confirmed this misfit, revealing over and under discrimination for the items with highest negative and positive fit, respectively) and 14 items demonstrate statistically significant differential item functioning (DIF) (a combination of items with uniform and non-uniform DIF). When taken together the items with most concerning psychometric properties were the BI and FIMm Bowels and Bladder, Stairs, and FIMm Feeding.
Thus, at face value, the requirements of the Rasch model are not well met by the co-calibrated data, which reflect and build on the findings of others27–29 who have demonstrated a range of psychometric problems, including misfit for the BI and FIMm (largely because of the mixing of clinically different constructs, eg, activities, mobility, sphincter function) and DIF for the FIMm. However, we explored the impact of the psychometric problems by modifying the data (albeit post hoc) to overcome the weaknesses (focusing specifically on items with disordered thresholds and exclusion of the items with poorest fit) and repeating our analyses of relative responsiveness. The same conclusion was reached; that effect sizes appear to be misleading when seeking to understand the relative ability of scales to detect change (data available on request).
A third explanation is that the results were biased by the therapists who rated the patients. It is conceivable that the clinically crude response categories of the BI might encourage therapists to overestimate people's activity limitation on admission and their activity limitation change at discharge. Our data do not allow us to investigate this further. Another explanation is that our findings reflect the different precisions of the two scales. If this is the case, measures of the same construct, but with different precisions, may come to different conclusions about change associated with an intervention. This warrants further interrogation not possible within our BI and FIMm data.
Although the psychometric concerns outlined above are important considerations, we believe our examination of the two scales in this study bring to the fore some key clinical issues. We had the rare opportunity to directly compare two firmly established, widely used, highly clinically related instruments, one of which (the FIMm) was developed to improve upon the perceived insensitivity of the other (BI). Clinical experience suggests that the FIMm is more responsive than the BI. Thus we would expect our study to find the FIMm better able to measure change. However, inferences based on the widely used traditional responsiveness indicators would lead us to believe the FIMm and BI are equally responsive. What we hope we have achieved here is that we have shown that using the more sophisticated analysis techniques (afforded by Rasch measurement methods) indicated that the FIMm is indeed more sensitive to change than the BI. This is in line with clinical expectation, and has important ramifications for the use of the tools in clinical research and trials, and the methods we use to determine and compare scale responsiveness.
At present we do not have a full explanation for our findings. However, from a clinical perspective we would expect the FIMm to be more sensitive to change than the BI. So, we believe that our inference that group based statistics are misleading has credence. Despite this, at the current time, we cannot square the circle of the issues identified by this study. There is a clear need for further work using scales that better fit the Rasch model requirements, to elaborate upon what we have uncovered here and to ultimately pin down its root cause.
Rasch measurement is not the only psychometric method available to analysing change in individual person level data. The other main new method is called Item Response Theory (IRT),30 which in contrast with Rasch measurement takes into account other sample related parameters such as item discrimination. Despite being mathematically similar, Rasch measurement and IRT have different research agendas.19 In essence, albeit a simplification, IRT models are statistical models used to explain data. When the observed data do not fit the chosen IRT model, another model is sought to better explain the data (ie, taking into account other sample dependent parameters as described above). In contrast, Rasch analysis provides a mathematical model for guiding the construction of stable linear measures from rating scale data.
The aim of a Rasch measurement analysis is to determine the extent to which observed rating scale data satisfy (fit) the measurement model.iii This is vital for measuring change as the most important measurement axiom is the ability to test for invariance (stability).11 This is achievable with Rasch measurement but not with IRT models as the presence of other parameters renders the estimates sample dependent.3 17 19 It follows that Rasch measurement enabled us to obtain interval level activity limitation measurements to be estimated from ordinal BI and FIMm scores, legitimately examine change at the individual person level rather than just the group comparison level and direct comparison of the BI and FIMm on the same activity limitation metric. We chose Rasch measurement rather than IRT for these very specific reasons.
We feel that it is vital that neurologists are aware of the key issues surrounding the use and analysis of rating scale data because rating scales have an increasingly crucial role in the determination of patient care, the guidance of clinical research directions, the evaluation of advances in basic science and the evaluation of clinician professionalism.31 32 Each of these eventually impacts on patients, clinical practice and clinicians.
One limitation of this study is that the responsiveness of the BI and FIMm was evaluated in a sample of neurorehabilitation patients, which included a large subgroup of people with MS, from one tertiary referral hospital in the south east region of the UK. Importantly, when we analysed the main clinical subgroups within our sample, the findings remained the same (data available from authors). However, to examine generalisability, it is important that others seek to replicate our analyses.
Our findings have three important implications for clinicians, clinical practice and clinical trials. Firstly, they demonstrate that group based statistics can be misleading, not of their own volition, when representing the ability, and relative ability, of scales to detect change. As such, they demonstrate the added value of using Rasch analysis and indicate that group based analyses should be complemented by legitimate analyses at the individual person level. The second implication, a consequence of the first, is that clinical investigators need to become familiar with, and apply, modern psychometric methods that enable legitimate comparisons at the individual person level. Traditional psychometric analyses, using raw scores, are not suitable for that purpose. Thirdly, although Rasch analysis does not confirm clinical change, it helps to take us further than existing approaches because the information provides us with a firm quantitative base upon which qualitative explorations of the differences between those people who report change and those who do not. We believe it is these sorts of explorations that can move us towards a better understanding the nuances of what constitutes clinical change. When considered together, the findings demonstrate the added value that Rasch analysis brings to examining and understanding measuring change in activity limitation.
Acknowledgments
The authors thank Professor David Andrich (University of Western Australia, Perth, Western Australia) and Dr Barry Sheridan (RUMM Laboratory, Perth, Western Australia) for their contributions towards this work, and staff at the Neurorehabilitation Unit, National Hospital for Neurology and Neurosurgery London UK who routinely collect audit data.
References
Footnotes
Linked articles 206409.
Competing interests None.
Ethics approval This study was conducted with the approval of the National Hospital for Neurology and Neurosurgery.
Provenance and peer review Not commissioned; externally peer reviewed.
↵i At admission and discharge, total score ceiling effects were lower for the FIMm than the BI. Thus a significant proportion of patients who scored at the maximum of the BI were within the floor and ceiling of the FIM, implying that the latter does in fact have an extended range of measurement and thus a better potential to detect change.
↵ii These inferences are legitimate because Rasch analysis enables scales measuring the same construct to be equated on a common interval level metric.
↵iii In contrast, the aim of an IRT analysis is to determine the extent to which the measurement models fit the rating scale data. This fundamental different is poorly appreciated.
Linked Articles
- Editorial commentary