Our Review is a focused critique of the literature on the basis of articles, reports, and book chapters that span more than a century of research in three areas: psychometrics, health measurement, and neurological clinical trials. These were collected as part of the general strategy in our unit (Neurological Outcome Measures Unit) during the past 15 years to develop a clear and detailed understanding of the science behind rating scales. Our search strategy included searches of electronic
ReviewRating scales as outcome measures for clinical trials in neurology: problems, solutions, and recommendations
Introduction
A recent review of UK health research funding1 emphasised the importance of translational research and highlighted an internationally recognised problem: success in basic science rarely leads to effective treatments. Why have state-of-the-art clinical trials failed to deliver treatments? Are all candidate molecules that work in controlled laboratory settings worthless when studied in human beings? Conversely, do some of the methods used to test the efficacy of treatments hinder advances in basic science?
In this Review we focus on the latter point and, in particular, the rating scales used to measure the health outcomes of trials for the treatment of neurological diseases, which are increasingly selected as primary or secondary outcome measures in clinical trials.2, 3, 4, 5, 6 Rating scales are, therefore, the main dependent variables on which decisions are made that influence patient care and guide future research; the adequacy of these decisions depends directly on the scientific quality of the rating scales.
Two developments indicate an appreciation of this fact: the increased application of the science of rating scales (psychometrics) for the measurement of health outcomes in clinical neurology; and the impending US Food and Drug Administration's (FDA) scientific requirements for patient-reported rating scales in clinical trials.7, 8 The FDA requirements are likely to be emulated by the European Medicines Agency (EMEA)9 and will be pertinent to all rating scales, not just those that are patient-reported.
Our opening remarks might suggest that we think that published data from clinical trials are littered with type-2 errors due to poor rating scales. We do not know whether this is the case; nor do we know the frequency of type-1 errors that arise from problems with rating scales. We do know, however, that the reliability, validity, and responsiveness of different scales will influence their ability to estimate accurately the effect of a disease, to detect clinical change, and will have implications for calculations of sample size.10 As such, the differences among rating scales have the potential to influence the outcome of clinical trials (panel 1).
Therefore, clinicians need to ensure that rating scales are fit for purpose, and maximising the scientific rigour of rating scales improves the chances of coming to the correct conclusion about the efficacy of a treatment. On this basis, a fundamental requirement of rigorous clinical trials is that the numbers generated by rating scales satisfy established scientific criteria as measurements of explicit, clinically meaningful variables.
A review of the subject of rating scales as outcome measures is, therefore, timely. We introduce the basic principles of the mechanics of rating scales and the limitations of the data derived from them. We discuss the benefits of moving to new psychometric methods and make recommendations to bring rating scales into line with what they measure. We highlight two methodological limitations that require attention to ensure that state-of-the-art clinical trials are underpinned by state-of-the-art measurements: the first limitation is that the numbers generated by most rating scales do not satisfy criteria as rigorous measurements; the second limitation is that we do not really know what variables most rating scales are measuring. These facts have great potential to undermine clinical trials, patient care, and research. The extent to which the limitations of rating scales are to blame for the failure of clinical trials to deliver treatments is unknown. However, our review highlights the potential contribution of rating scales and the way their data are analysed.
Section snippets
Basis of rating scales as outcome measures
Some variables (eg, height and weight) can be measured directly. Other variables (eg, disability, cognitive function, and quality of life) are measured indirectly by how they manifest; therefore, we need a method to transform the manifestations of these “latent” variables into numbers that can be taken as measurements.13
Rating scales are a means to measure latent variables, and two types of rating scale are commonly used in neurology: single item scales (eg, Ashworth scale [figure 1],14
The requirement for rating scales to generate rigorous measurements
Phase III clinical trials need rating scales that generate rigorous measurements. Unfortunately, this is rarely achieved because most rating scales generate ordered scores that are only suitable for group comparison studies, rather than precise measurements of an individual.
The requirement to know precisely what variables are measured
Clinical trials require rating scales that actually measure the health constructs that they claim to (ie, the scales are valid) and health constructs that are clinically meaningful and can be interpreted. Unfortunately, current methods to establish the validity of a rating scale rarely meet these goals.
Recommendations
The FDA draft recommendations for patient-reported rating scales in clinical trials highlight the importance of “conceptually sound, reliable, and valid measures”.8 Such an acknowledgment is a vital, albeit first, step. Surprisingly, the document barely mentions new psychometric methods, despite their clear advantages and increased use;104, 105, 106, 107, 108, 109 furthermore, despite the emphasis on the improvement of methods to establish validity, they do not provide detailed guidance on how
Conclusions
In this Review we posed a question: why have state-of-the-art clinical trials in neurology failed to deliver treatments? Our aim was to highlight the potential contribution to this failure of the currently available rating scales and the way their data are analysed. However, rating scales are not always to blame. Indeed, the extent to which rating scales undermine inferences from clinical trials is difficult to determine. Our message is simple: when rating scales are used, they must be fit for
Search strategy and selection criteria
References (118)
- et al.
Cannabinoids for treatment of spasticity and other symptoms related to multiple sclerosis (CAMS study): multi-centre randomised placebo-controlled trial
Lancet
(2003) - et al.
TCH346 as a neuroprotective drug in Parkinson's disease: a double-blind, randomised, controlled trial
Lancet Neurol
(2006) FDA draft guidance and health-outcomes research
Lancet
(2007)- et al.
Assessing the clinical significance of single items relative to summated scores
Mayo Clin Proc
(2002) - et al.
Effect of early versus delayed interferon beta-1b treatment on disability after a first clinical event suggestive of multiple sclerosis: a 3-year follow-up analysis of the BENEFIT study
Lancet
(2007) The axioms and principal results of classical test theory
J Math Psychol
(1966)Clinical interpretation and use of stroke scales
Lancet Neurol
(2006)A framework relating outcomes based education and the taxonomy of educational objectives
Stud Educ Eval
(2002)Implication and applications of modern test theory in the context of outcomes based research
Stud Educ Eval
(2002)Measurement