CommentaryOverview of Classical Test Theory and Item Response Theory for the Quantitative Assessment of Items in Developing Patient-Reported Outcomes Measures
Introduction
The publication of the US Food and Drug Administration’s guidance for industry on patient-reported outcomes (PRO)1 has generated discussion and debate on the methods used for developing, and establishing the content validity of, PRO instruments. The guidance outlines the information that the FDA considers when evaluating a PRO measure as a primary or secondary end point to support a claim in medical product labeling. The PRO guidance highlights the importance of establishing evidence of content validity, defined as “the extent to which the instrument measures the concept of interest” (p. 12).1
Content validity is the extent to which an instrument covers the important concepts of the unobservable, or latent, attribute (eg, depression, anxiety, physical functioning, self-esteem) that the instrument purports to measure. It is the degree to which the content of a measurement instrument is an adequate reflection of the construct being measured. Hence, qualitative work with patients is essential to ensure that a PRO instrument captures all of the important aspects of the concept from the patient’s perspective.
Two reports from the International Society of Pharmacoeconomics and Outcomes Research Good Research Practices Task Force2, 3 detail the qualitative methodology and 5 steps that should be employed to establish content validity of a PRO measure: (1) determine the context of use (eg, medical product labeling); (2) develop the research protocol for qualitative concept elicitation and analysis; (3) conduct the concept elicitation interviews and focus groups; 4) analyze the qualitative data; and (5) document concept development, elicitation methodology, and results. Essentially, the inclusion of the entire range of relevant issues in the target population embodies adequate content validity of a PRO instrument.
Although qualitative data from interviews and focus groups with the targeted patient sample are necessary to develop PRO measures, qualitative data alone are not sufficient to document the content validity of the measure. Along with qualitative methods, quantitative methods are needed to develop PRO measures with good measurement properties. Quantitative data gathered during earlier stages of instrument development can serve as: (1) a barometer to see how well items address the entire continuum of the targeted concept of interest; (2) a gauge of whether to go forward with psychometric testing; and (3) a meter to mitigate risk related to Phase III signal detection and interpretation.
Specifically, quantitative methods can support the development of PRO measures by addressing several core questions of content validity: What is the range of item responses relative to the sample (distribution of item responses and their endorsement)?; Are the response options used by patients as intended?; Does a higher response option imply a greater health problem than does a lower response option?; and What is the distance between response categories in terms of the underlying concept?
Also relevant is the extent to which the instrument reliably assesses the full range of the target population (scale-to-sample targeting), ceiling or floor effects, and the distribution of the total scores. Does the item order with respect to disease severity reflect the hypothesized item order? To what extent do item characteristics relate to how patients rank the items in terms of their importance or bother?
This article reviews the classical test theory and the item response theory (IRT) approaches to developing PRO measures and to addressing these questions. These content-based questions and the 2 quantitative approaches to addressing them are consistent with construct validity, now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity.4 The use of quantitative methods early in instrument development is aimed at providing descriptive profiles and exploratory information about the content represented in a draft PRO instrument. Confirmatory psychometric evaluations, occurring at the later stages of instrument development, should be used to provide more definitive information regarding the measurement characteristics of the instrument.
Section snippets
Classical Test Theory
Classical test theory is a conventional quantitative approach to testing the reliability and validity of a scale based on its items. In the context of PRO measures, classical test theory assumes that each observed score (X) on a PRO instrument is a combination of an underlying true score (T) on the concept of interest and nonsystematic (ie, random) error (E). Classical test theory, also known as true-score theory, assumes that each person has a true score, T, that would be obtained if there
Item Response Theory
Item response theory (IRT) is a collection of measurement models that attempt to explain the connection between observed item responses on a scale and an underlying construct. Specifically, IRT models are mathematical equations describing the association between subjects’ levels on a latent variable and the probability of a particular response to an item, using a nonlinear monotonic function.14 As in classical test theory, IRT requires that each item be distinct from the others, yet similar and
Discussion
Classical test theory and IRT provide useful methods for assessing content validity during the early development of a PRO measure. IRT requires several items so that there is adequate opportunity to have a sufficient range for levels of item difficulty and person attribute. Single-item measures, or too few items, are not suitable for IRT analysis (or, for that matter, for some analyses in classical test theory). In IRT and classical test theory, each item should be distinct from the others, yet
Conclusions
The present article presents an overview of classical test theory and IRT in the quantitative assessment of items and scales during the content-validity phase of PRO-measure development. Depending on the particular type of measure and the specific circumstances, either approach or both approaches may be useful to help maximize the content validity of a PRO measure.
Conflicts of Interest
J. Cappelleri is an employee of, and holds stock options in, Pfizer Inc. The opinions expressed here do not reflect the views of Pfizer Inc or any other institution. The authors have indicated that they have no other conflicts of interest with regard to the content of this article.
Acknowledgments
The authors gratefully acknowledge comments from Dr. Stephen Coons (Critical Path Institute) on the manuscript and also the comprehensive set of comments from 2 anonymous reviewers, all of which improved the quality of the article.
Dr. Lundy is an employee of the Critical Path Institute, which is supported by grant No. U01FD003865 from the United States Food and Drug Administration. Dr. Hays was supported in part by funding from the Critical Path Institute and by grants from the Agency for
References (23)
- et al.
Content validity—establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force Report: Part 1—eliciting concepts for a new PRO instrument
Value Health
(2011) - et al.
Content validity—establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force Report: Part 2—assessing respondent understanding
Value Health
(2011) Guidance for industry. Patient-reported outcome measures: use in medical product development to support labeling claims
Fed Reg
(2009)- et al.
Construct validity: advances in theory and methodology
Ann Rev Clin Psychol
(2009) - et al.
Psychological Testing
(1997) - et al.
Approaches and recommendations for estimating minimally important differences for health-related quality of life measures
COPD
(2005) - et al.
Multitrait scaling program: MULTI
Proceedings of the Seventeenth Annual SAS Users Group International Conference
(1992) - et al.
Evaluating multi-item scales
- et al.
Patient-Reported Outcomes: Measurements, Implementation and Interpretation
(2014) - et al.
Some standard errors in item response theory
Psychometrika
(1982)
Using Multivariate Statistics
Cited by (358)
OFF episode quality of life impact scale (OFFELIA): A new measure of quality of life for off episodes in Parkinson's disease
2024, Parkinsonism and Related DisordersPatient activation and medication adherence in adults
2024, Journal of the American Pharmacists AssociationNew Dizziness Impact Measures of Positional, Functional, and Emotional Status Were Supported for Reliability, Validity, and Efficiency
2024, Archives of Rehabilitation Research and Clinical Translation