Rating scales as outcome measures for clinical trials in neurology: problems, solutions, and recommendations

doi:10.1016/S1474-4422(07)70290-9

The Lancet Neurology

Volume 6, Issue 12, December 2007, Pages 1094-1105

https://doi.org/10.1016/S1474-4422(07)70290-9 Get rights and content

Summary

Have state-of-the-art clinical trials failed to deliver treatments for neurodegenerative diseases because of shortcomings in the rating scales used? This Review assesses two methodological limitations of rating scales that might help to answer this question. First, the numbers generated by most rating scales do not satisfy the criteria for rigorous measurements. Second, we do not really know which variables most rating scales measure. We use clinical examples to highlight concerns about the limitations of rating scales, examine their underlying rationales, clarify their implications, explore potential solutions, and make some recommendations for future research. We show that improvements in the scientific rigour of rating scales can improve the chances of reaching the correct conclusions about the effectiveness of treatments.

Introduction

A recent review of UK health research funding¹ emphasised the importance of translational research and highlighted an internationally recognised problem: success in basic science rarely leads to effective treatments. Why have state-of-the-art clinical trials failed to deliver treatments? Are all candidate molecules that work in controlled laboratory settings worthless when studied in human beings? Conversely, do some of the methods used to test the efficacy of treatments hinder advances in basic science?

In this Review we focus on the latter point and, in particular, the rating scales used to measure the health outcomes of trials for the treatment of neurological diseases, which are increasingly selected as primary or secondary outcome measures in clinical trials.2, 3, 4, 5, 6 Rating scales are, therefore, the main dependent variables on which decisions are made that influence patient care and guide future research; the adequacy of these decisions depends directly on the scientific quality of the rating scales.

Two developments indicate an appreciation of this fact: the increased application of the science of rating scales (psychometrics) for the measurement of health outcomes in clinical neurology; and the impending US Food and Drug Administration's (FDA) scientific requirements for patient-reported rating scales in clinical trials.7, 8 The FDA requirements are likely to be emulated by the European Medicines Agency (EMEA)⁹ and will be pertinent to all rating scales, not just those that are patient-reported.

Our opening remarks might suggest that we think that published data from clinical trials are littered with type-2 errors due to poor rating scales. We do not know whether this is the case; nor do we know the frequency of type-1 errors that arise from problems with rating scales. We do know, however, that the reliability, validity, and responsiveness of different scales will influence their ability to estimate accurately the effect of a disease, to detect clinical change, and will have implications for calculations of sample size.¹⁰ As such, the differences among rating scales have the potential to influence the outcome of clinical trials (panel 1).

Therefore, clinicians need to ensure that rating scales are fit for purpose, and maximising the scientific rigour of rating scales improves the chances of coming to the correct conclusion about the efficacy of a treatment. On this basis, a fundamental requirement of rigorous clinical trials is that the numbers generated by rating scales satisfy established scientific criteria as measurements of explicit, clinically meaningful variables.

A review of the subject of rating scales as outcome measures is, therefore, timely. We introduce the basic principles of the mechanics of rating scales and the limitations of the data derived from them. We discuss the benefits of moving to new psychometric methods and make recommendations to bring rating scales into line with what they measure. We highlight two methodological limitations that require attention to ensure that state-of-the-art clinical trials are underpinned by state-of-the-art measurements: the first limitation is that the numbers generated by most rating scales do not satisfy criteria as rigorous measurements; the second limitation is that we do not really know what variables most rating scales are measuring. These facts have great potential to undermine clinical trials, patient care, and research. The extent to which the limitations of rating scales are to blame for the failure of clinical trials to deliver treatments is unknown. However, our review highlights the potential contribution of rating scales and the way their data are analysed.

Section snippets

Basis of rating scales as outcome measures

Some variables (eg, height and weight) can be measured directly. Other variables (eg, disability, cognitive function, and quality of life) are measured indirectly by how they manifest; therefore, we need a method to transform the manifestations of these “latent” variables into numbers that can be taken as measurements.¹³

Rating scales are a means to measure latent variables, and two types of rating scale are commonly used in neurology: single item scales (eg, Ashworth scale [figure 1],¹⁴

The requirement for rating scales to generate rigorous measurements

Phase III clinical trials need rating scales that generate rigorous measurements. Unfortunately, this is rarely achieved because most rating scales generate ordered scores that are only suitable for group comparison studies, rather than precise measurements of an individual.

The requirement to know precisely what variables are measured

Clinical trials require rating scales that actually measure the health constructs that they claim to (ie, the scales are valid) and health constructs that are clinically meaningful and can be interpreted. Unfortunately, current methods to establish the validity of a rating scale rarely meet these goals.

Recommendations

The FDA draft recommendations for patient-reported rating scales in clinical trials highlight the importance of “conceptually sound, reliable, and valid measures”.⁸ Such an acknowledgment is a vital, albeit first, step. Surprisingly, the document barely mentions new psychometric methods, despite their clear advantages and increased use;104, 105, 106, 107, 108, 109 furthermore, despite the emphasis on the improvement of methods to establish validity, they do not provide detailed guidance on how

Conclusions

In this Review we posed a question: why have state-of-the-art clinical trials in neurology failed to deliver treatments? Our aim was to highlight the potential contribution to this failure of the currently available rating scales and the way their data are analysed. However, rating scales are not always to blame. Indeed, the extent to which rating scales undermine inferences from clinical trials is difficult to determine. Our message is simple: when rating scales are used, they must be fit for

Search strategy and selection criteria

Our Review is a focused critique of the literature on the basis of articles, reports, and book chapters that span more than a century of research in three areas: psychometrics, health measurement, and neurological clinical trials. These were collected as part of the general strategy in our unit (Neurological Outcome Measures Unit) during the past 15 years to develop a clear and detailed understanding of the science behind rating scales. Our search strategy included searches of electronic

References (118)

J Zajicek et al.
Cannabinoids for treatment of spasticity and other symptoms related to multiple sclerosis (CAMS study): multi-centre randomised placebo-controlled trial
Lancet
(2003)
C Olanow et al.
TCH346 as a neuroprotective drug in Parkinson's disease: a double-blind, randomised, controlled trial
Lancet Neurol
(2006)
D Revicki
FDA draft guidance and health-outcomes research
Lancet
(2007)
JA Sloan et al.
Assessing the clinical significance of single items relative to summated scores
Mayo Clin Proc
(2002)
L Kappos et al.
Effect of early versus delayed interferon beta-1b treatment on disability after a first clinical event suggestive of multiple sclerosis: a 3-year follow-up analysis of the BENEFIT study
Lancet
(2007)
MR Novick
The axioms and principal results of classical test theory
J Math Psychol
(1966)
S Kasner
Clinical interpretation and use of stroke scales
Lancet Neurol
(2006)
D Andrich
A framework relating outcomes based education and the taxonomy of educational objectives
Stud Educ Eval
(2002)
D Andrich
Implication and applications of modern test theory in the context of outcomes based research
Stud Educ Eval
(2002)
GW Bohrnstedt
Measurement

D Cooksey

A review of health research funding

(2006)

P Aisen et al.

Effects of rofecoxib or naproxen vs placebo on Alzheimer's disease progression: a randomized controlled trial

JAMA

(2003)

J Fairbank et al.

Randomised controlled trial to compare surgical stabilisation of the lumbar spine with an intensive rehabilitation programme for patients with chronic low back pain: the MRC spine stabilisation trial

BMJ

(2005)

K Lees et al.

NXY-059 for acute ischemic stroke

N Engl J Med

(2006)

Patient reported outcome measures: use in medical product development to support labelling claims, 2006

Reflection paper on the regulatory guidance for the use of the health-related quality of life (HRQL) measures in the evaluation of medicinal products

(2006)

JC Hobart et al.

How responsive is the MSIS-29? A comparison with other self report scales

J Neurol Neurosurg Psychiatr

(2005)

DATATOP: a multicenter controlled clinical trial in early Parkinson's disease

Arch Neurol

(1989)

F Stocchi et al.

Neuroprotection in Parkinson's disease: clinical trials

Ann Neurol

(2003)

BD Wright et al.

Rating scale analysis: Rasch measurement

(1982)

B Ashworth

Preliminary trial of carisoprodol in multiple sclerosis

Practitioner

(1964)

JF Kurtzke

Rating neurological impairment in multiple sclerosis: an expanded disability status scale (EDSS)

Neurology

(1983)

J Rankin

Cerebral vascular accidents in patients over the age of 60: II. Prognosis

Scott Med J

(1957)

S Hauser et al.

Intensive immunosuppression in progressive multiple sclerosis: a randomised three-arm study of high dose intravenous cyclophosphamide, plasma exchange, and ACTH

N Engl J Med

(1983)

MM Hoehn et al.

Parkinsonism: onset, progression, and mortality

Neurology

(1967)

FM Collen et al.

The Rivermead Mobility Index: a further development of the Rivermead Motor Assessment

Int Disabil Stud

(1991)

FI Mahoney et al.

Functional evaluation: the Barthel Index

Maryland State Med J

(1965)

CV Granger et al.

Advances in functional assessment for medical rehabilitation

Topics Geriatr Rehab

(1986)

JC Nunnally

Psychometric theory

(1967)

W Manning et al.

The status of health in demand estimation: or beyond excellent, good, fair, and poor

EuroQoL: a new facility for the measurement of health-related quality of life

Health Policy

(1990)

B Haas et al.

The inter rater reliability of the original and of the modified Ashworth scale for the assessment of spasticity in patients with spinal cord injury

Spinal Cord

(1996)

JC Hobart et al.

Kurtzke scales revisited: the application of psychometric methods to clinical intuition

Brain

(2000)

M Blackburn et al.

Reliability of measurements obtained with the modified Ashworth scale in the lower extremities of people with stroke

Phys Ther

(2002)

N Clopton et al.

Interrater and intrarater reliability of the Modified Ashworth Scale in children with hypertonia

Pediatric Physical Therapy

(2005)

J Wilson et al.

Reliability of the modified Rankin Scale across multiple raters: benefits of a structured interview

Stroke

(2005)

W Yam et al.

Interrater reliability of Modified Ashworth Scale and Modified Tardieu Scale in children with spastic cerebral palsy

J Child Neurol

(2006)

P New et al.

Critical appraisal and review of the Rankin scale and its derivatives

Neuroepidemiology

(2006)

CA McHorney et al.

The validity and relative precision of MOS short- and long-form health status scales and Dartmouth COOP charts

Med Care

(1992)

JC Hobart

Rating scales for neurologists

J Neurol Neursurg Psychiatr

(2003)

C Vaney et al.

Efficacy, safety and tolerability of an orally administered cannabis extract in the treatment of spasticity in patients with multiple sclerosis: a randomized, double-blind, placebo-controlled, crossover study

Mult Scler

(2004)

M Uyttenboogaart et al.

Measuring disability in stroke: relationship between the modified Rankin scale and the Barthel index

J Neurol

(2007)

JC Hobart et al.

Measuring the impact of MS on walking ability: the 12-item MS Walking Scale (MSWS-12)

Neurology

(2003)

EL Thorndike

An introduction to the theory of mental and social measurements

(1904)

LL Thurstone

Theory of attitude measurement

Psychol Rev

(1929)

C Merbitz et al.

Ordinal scales and foundations of misinference

Arch Phys Med Rehabil

(1989)

BD Wright et al.

Observations are always ordinal: measurements, however, must be interval

Arch Phys Med Rehabil

(1989)

R Massof

The measurement of vision disability

Optom Vis Sci

(2002)

J Michell

Measurement: a beginner's guide

J Appl Meas

(2003)

J Michell

An introduction to the logical of psychological measurement

(1990)

Cited by (0)

View full text

ReviewRating scales as outcome measures for clinical trials in neurology: problems, solutions, and recommendations

Summary

Introduction

Section snippets

Basis of rating scales as outcome measures

The requirement for rating scales to generate rigorous measurements

The requirement to know precisely what variables are measured

Recommendations

Conclusions

Search strategy and selection criteria

Lancet

Lancet Neurol

Lancet

Mayo Clin Proc

Lancet

J Math Psychol

Lancet Neurol

Stud Educ Eval

Stud Educ Eval

A review of health research funding

Effects of rofecoxib or naproxen vs placebo on Alzheimer's disease progression: a randomized controlled trial

JAMA

Randomised controlled trial to compare surgical stabilisation of the lumbar spine with an intensive rehabilitation programme for patients with chronic low back pain: the MRC spine stabilisation trial

BMJ

NXY-059 for acute ischemic stroke

N Engl J Med

Patient reported outcome measures: use in medical product development to support labelling claims, 2006

Reflection paper on the regulatory guidance for the use of the health-related quality of life (HRQL) measures in the evaluation of medicinal products

How responsive is the MSIS-29? A comparison with other self report scales

J Neurol Neurosurg Psychiatr

DATATOP: a multicenter controlled clinical trial in early Parkinson's disease

Arch Neurol

Neuroprotection in Parkinson's disease: clinical trials

Ann Neurol

Rating scale analysis: Rasch measurement

Preliminary trial of carisoprodol in multiple sclerosis

Practitioner

Rating neurological impairment in multiple sclerosis: an expanded disability status scale (EDSS)

Neurology

Cerebral vascular accidents in patients over the age of 60: II. Prognosis

Scott Med J

Intensive immunosuppression in progressive multiple sclerosis: a randomised three-arm study of high dose intravenous cyclophosphamide, plasma exchange, and ACTH

N Engl J Med

Parkinsonism: onset, progression, and mortality

Neurology

The Rivermead Mobility Index: a further development of the Rivermead Motor Assessment

Int Disabil Stud

Functional evaluation: the Barthel Index

Maryland State Med J

Advances in functional assessment for medical rehabilitation

Topics Geriatr Rehab

Psychometric theory

The status of health in demand estimation: or beyond excellent, good, fair, and poor

EuroQoL: a new facility for the measurement of health-related quality of life

Health Policy

The inter rater reliability of the original and of the modified Ashworth scale for the assessment of spasticity in patients with spinal cord injury

Spinal Cord

Kurtzke scales revisited: the application of psychometric methods to clinical intuition

Brain

Reliability of measurements obtained with the modified Ashworth scale in the lower extremities of people with stroke

Phys Ther

Interrater and intrarater reliability of the Modified Ashworth Scale in children with hypertonia

Pediatric Physical Therapy

Reliability of the modified Rankin Scale across multiple raters: benefits of a structured interview

Stroke

Interrater reliability of Modified Ashworth Scale and Modified Tardieu Scale in children with spastic cerebral palsy

J Child Neurol

Critical appraisal and review of the Rankin scale and its derivatives

Neuroepidemiology

The validity and relative precision of MOS short- and long-form health status scales and Dartmouth COOP charts

Med Care

Rating scales for neurologists

J Neurol Neursurg Psychiatr

Efficacy, safety and tolerability of an orally administered cannabis extract in the treatment of spasticity in patients with multiple sclerosis: a randomized, double-blind, placebo-controlled, crossover study

Mult Scler

Measuring disability in stroke: relationship between the modified Rankin scale and the Barthel index

J Neurol

Measuring the impact of MS on walking ability: the 12-item MS Walking Scale (MSWS-12)

Neurology

An introduction to the theory of mental and social measurements

Review
Rating scales as outcome measures for clinical trials in neurology: problems, solutions, and recommendations