Overview of Classical Test Theory and Item Response Theory for the Quantitative Assessment of Items in Developing Patient-Reported Outcomes Measures

doi:10.1016/j.clinthera.2014.04.006

Clinical Therapeutics

Volume 36, Issue 5, 1 May 2014, Pages 648-662

https://doi.org/10.1016/j.clinthera.2014.04.006 Get rights and content

Abstract

Background

The US Food and Drug Administration’s guidance for industry document on patient-reported outcomes (PRO) defines content validity as “the extent to which the instrument measures the concept of interest” (FDA, 2009, p. 12). According to Strauss and Smith (2009), construct validity "is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity” (p. 7). Hence, both qualitative and quantitative information are essential in evaluating the validity of measures.

Methods

We review classical test theory and item response theory (IRT) approaches to evaluating PRO measures, including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized “difficulty” (severity) order of items is represented by observed responses.

Results

If a researcher has few qualitative data and wants to get preliminary information about the content validity of the instrument, then descriptive assessments using classical test theory should be the first step. As the sample size grows during subsequent stages of instrument development, confidence in the numerical estimates from Rasch and other IRT models (as well as those of classical test theory) would also grow.

Conclusion

Classical test theory and IRT can be useful in providing a quantitative assessment of items and scales during the content-validity phase of PRO-measure development. Depending on the particular type of measure and the specific circumstances, the classical test theory and/or the IRT should be considered to help maximize the content validity of PRO measures.

Introduction

The publication of the US Food and Drug Administration’s guidance for industry on patient-reported outcomes (PRO)¹ has generated discussion and debate on the methods used for developing, and establishing the content validity of, PRO instruments. The guidance outlines the information that the FDA considers when evaluating a PRO measure as a primary or secondary end point to support a claim in medical product labeling. The PRO guidance highlights the importance of establishing evidence of content validity, defined as “the extent to which the instrument measures the concept of interest” (p. 12).¹

Content validity is the extent to which an instrument covers the important concepts of the unobservable, or latent, attribute (eg, depression, anxiety, physical functioning, self-esteem) that the instrument purports to measure. It is the degree to which the content of a measurement instrument is an adequate reflection of the construct being measured. Hence, qualitative work with patients is essential to ensure that a PRO instrument captures all of the important aspects of the concept from the patient’s perspective.

Two reports from the International Society of Pharmacoeconomics and Outcomes Research Good Research Practices Task Force2, 3 detail the qualitative methodology and 5 steps that should be employed to establish content validity of a PRO measure: (1) determine the context of use (eg, medical product labeling); (2) develop the research protocol for qualitative concept elicitation and analysis; (3) conduct the concept elicitation interviews and focus groups; 4) analyze the qualitative data; and (5) document concept development, elicitation methodology, and results. Essentially, the inclusion of the entire range of relevant issues in the target population embodies adequate content validity of a PRO instrument.

Although qualitative data from interviews and focus groups with the targeted patient sample are necessary to develop PRO measures, qualitative data alone are not sufficient to document the content validity of the measure. Along with qualitative methods, quantitative methods are needed to develop PRO measures with good measurement properties. Quantitative data gathered during earlier stages of instrument development can serve as: (1) a barometer to see how well items address the entire continuum of the targeted concept of interest; (2) a gauge of whether to go forward with psychometric testing; and (3) a meter to mitigate risk related to Phase III signal detection and interpretation.

Specifically, quantitative methods can support the development of PRO measures by addressing several core questions of content validity: What is the range of item responses relative to the sample (distribution of item responses and their endorsement)?; Are the response options used by patients as intended?; Does a higher response option imply a greater health problem than does a lower response option?; and What is the distance between response categories in terms of the underlying concept?

Also relevant is the extent to which the instrument reliably assesses the full range of the target population (scale-to-sample targeting), ceiling or floor effects, and the distribution of the total scores. Does the item order with respect to disease severity reflect the hypothesized item order? To what extent do item characteristics relate to how patients rank the items in terms of their importance or bother?

This article reviews the classical test theory and the item response theory (IRT) approaches to developing PRO measures and to addressing these questions. These content-based questions and the 2 quantitative approaches to addressing them are consistent with construct validity, now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity.⁴ The use of quantitative methods early in instrument development is aimed at providing descriptive profiles and exploratory information about the content represented in a draft PRO instrument. Confirmatory psychometric evaluations, occurring at the later stages of instrument development, should be used to provide more definitive information regarding the measurement characteristics of the instrument.

Section snippets

Classical Test Theory

Classical test theory is a conventional quantitative approach to testing the reliability and validity of a scale based on its items. In the context of PRO measures, classical test theory assumes that each observed score (X) on a PRO instrument is a combination of an underlying true score (T) on the concept of interest and nonsystematic (ie, random) error (E). Classical test theory, also known as true-score theory, assumes that each person has a true score, T, that would be obtained if there

Item Response Theory

Item response theory (IRT) is a collection of measurement models that attempt to explain the connection between observed item responses on a scale and an underlying construct. Specifically, IRT models are mathematical equations describing the association between subjects’ levels on a latent variable and the probability of a particular response to an item, using a nonlinear monotonic function.¹⁴ As in classical test theory, IRT requires that each item be distinct from the others, yet similar and

Discussion

Classical test theory and IRT provide useful methods for assessing content validity during the early development of a PRO measure. IRT requires several items so that there is adequate opportunity to have a sufficient range for levels of item difficulty and person attribute. Single-item measures, or too few items, are not suitable for IRT analysis (or, for that matter, for some analyses in classical test theory). In IRT and classical test theory, each item should be distinct from the others, yet

Conclusions

The present article presents an overview of classical test theory and IRT in the quantitative assessment of items and scales during the content-validity phase of PRO-measure development. Depending on the particular type of measure and the specific circumstances, either approach or both approaches may be useful to help maximize the content validity of a PRO measure.

Conflicts of Interest

J. Cappelleri is an employee of, and holds stock options in, Pfizer Inc. The opinions expressed here do not reflect the views of Pfizer Inc or any other institution. The authors have indicated that they have no other conflicts of interest with regard to the content of this article.

Acknowledgments

The authors gratefully acknowledge comments from Dr. Stephen Coons (Critical Path Institute) on the manuscript and also the comprehensive set of comments from 2 anonymous reviewers, all of which improved the quality of the article.

Dr. Lundy is an employee of the Critical Path Institute, which is supported by grant No. U01FD003865 from the United States Food and Drug Administration. Dr. Hays was supported in part by funding from the Critical Path Institute and by grants from the Agency for

References (23)

D.L. Patrick et al.
Content validity—establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force Report: Part 1—eliciting concepts for a new PRO instrument
Value Health
(2011)
D.L. Patrick et al.
Content validity—establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force Report: Part 2—assessing respondent understanding
Value Health
(2011)
Guidance for industry. Patient-reported outcome measures: use in medical product development to support labeling claims
Fed Reg
(2009)
M.E. Strauss et al.
Construct validity: advances in theory and methodology
Ann Rev Clin Psychol
(2009)
A. Anastasi et al.
Psychological Testing
(1997)
R.D. Hays et al.
Approaches and recommendations for estimating minimally important differences for health-related quality of life measures
COPD
(2005)
R.D. Hays et al.
Multitrait scaling program: MULTI
Proceedings of the Seventeenth Annual SAS Users Group International Conference
(1992)
R.D. Hays et al.
Evaluating multi-item scales
J.C. Cappelleri et al.
Patient-Reported Outcomes: Measurements, Implementation and Interpretation
(2014)
D. Thissen et al.
Some standard errors in item response theory
Psychometrika
(1982)

B.G. Tabachnick et al.

Using Multivariate Statistics

(1996)

Cited by (358)

OFF episode quality of life impact scale (OFFELIA): A new measure of quality of life for off episodes in Parkinson's disease
2024, Parkinsonism and Related Disorders
OFF Episodes occur in people with Parkinson's disease when their medication wears off, and motor and/or non-motor symptoms emerge. Existing measures used to assess OFF Episodes focus on the time spent in OFF Episodes through diaries or by identifying symptoms, but they are limited in their ability to capture the severity and functional impact of OFF episodes. The aim of this study was to develop and validate a new instrument, called “OFFELIA,” that measures the impact of OFF episodes on the quality of life of individuals with Parkinson's disease.
Participants completed a cross-sectional questionnaire, “Impact and Communication on OFF Periods,” while enrolled in the online clinical study Fox Insights. The data collected was used to develop OFFELIA. Psychometric testing was performed on 18 candidate items using classical, exploratory factor analysis, and item response theory methods.
569 individuals with Parkinson's disease completed the questionnaire. All items were retained for the final measure, with 17 items aggregated into two multi-item scales (functioning and psychological well-being) and one item reported separately as it did not function well with the other items (employment). Known group comparisons based on average duration, frequency and unpredictability of OFF episodes indicated that OFFELIA subscales were more sensitive than existing generic and condition-specific measures.
Initial evidence supports the validity of OFFELIA, a new instrument that assesses the impact of OFF periods on daily life. This instrument can be used in assessing clinical therapeutic strategies targeting OFF episodes in Parkinson's disease.
Randomized controlled trials reporting patient-reported outcomes with no significant differences between study groups are potentially susceptible to unjustified conclusions—a systematic review
2024, Journal of Clinical Epidemiology
Ceiling effect may lead to misleading conclusions when using patient-reported outcome measure (PROM) scores as an outcome. The aim of this study was to investigate the potential source of ceiling effect–related errors in randomized controlled trials (RCTs) reporting no differences in PROM scores between study groups.
A systematic review of RCTs published in the top 10 orthopedic journals according to their impact factors was conducted, focusing on studies that reported no significant differences in outcomes between two study groups. All studies published during 2012–2022 that reported no differences in PROM outcomes and used parametric statistical approach were included. The aim was to investigate the potential source of ceiling effect–related errors—that is, when the ceiling effect suppresses the possible difference between the groups. The proportions of patients exceeding the PROM scales were simulated using the observed dispersion parameters based on the assumed normal distribution, and the differences in the proportions between the study groups were subsequently analyzed.
After an initial screening of 2343 studies, 190 studies were included. The central 95% theoretical distribution of the scores exceeded the PROM scales in 140 (74%) of these studies. In 33 (17%) studies, the simulated patient proportions exceeding the scales indicated potential differences between the compared groups.
It is common to have a mismatch between the chosen PROM instrument and the population being studied increasing the risk of an unjustified “no difference” conclusion due to a ceiling effect. Thus, a considerable ceiling effect should be considered a potential source of error.
Patient activation and medication adherence in adults
2024, Journal of the American Pharmacists Association
Patients’ level of medication adherence provides conflicting results in its relationship to patient activation. Multiple factors may be contributing to these mixed results.
The primary purpose was to assess the association of patient activation to medication adherence in adults with chronic health conditions and low health literacy (HL). Secondary objectives were to determine whether age, education, gender, and race were associated with activation.
Participants completed self-report questionnaires regarding chronic disease self-management. Patient activation was measured using Hibbard’s Patient Activation Measure (PAM). Self-report of medication adherence was determined using the Gonzalez-Lu adherence questionnaire. Block regressions first assessed the relation of demographic variables and education to adherence and then the added relation of patient activation in a second model.
The analyses included 301 participants (mean age 58 years; 53% female; mean chronic conditions of 6.6). Some of the most common chronic conditions included hypertension (60%), arthritis (51%), depression (49%), and hyperlipidemia (43%). The relation of older age to greater medication adherence was significant (P < 0.05) in both models. The addition of PAM was significantly related to better adherence (P < 0.001) and also increased the R squared value from 0.04 to 0.09. This change resulted in a moderate effect size (d = 0.50).
Evaluating patient activation at baseline may predict those more likely to be medication adherent in patients with low HL.
The socioeconomic impact of cancer on patients and their relatives: Organisation of European Cancer Institutes task force consensus recommendations on conceptual framework, taxonomy, and research directions
2024, The Lancet Oncology
Loss of income and out-of-pocket expenditures are important causes of financial hardship in many patients with cancer, even in high-income countries. The far-reaching consequences extend beyond the patients themselves to their relatives, including caregivers and dependents. European research to date has been limited and is hampered by the absence of a coherent theoretical framework and by heterogeneous methods and terminology. To address these shortages, a task force initiated by the Organisation of European Cancer Institutes (OECI) produced 25 recommendations, including a comprehensive definition of socioeconomic impact from the perspective of patients and their relatives, a conceptual framework, and a consistent taxonomy linked to the framework. The OECI task force consensus statement highlights directions for future research with a view towards policy relevance. Beyond descriptive studies into the dimension of the problem, individual severity and predictors of vulnerability should be explored. It is anticipated that the consensus recommendations will facilitate and enhance future research efforts into the socioeconomic impact of cancer and cancer care, providing a crucial reference point for the development and validation of patient-reported outcome instruments aimed at measuring its broader effects.
Sleep problems during early and late infancy: Diverse impacts on child development trajectories across multiple domains
2024, Sleep Medicine
Child developmental rate holds predictive value for early-stage developmental trajectories, yet few studies explored how sleep problems during different infancy stages impact this rate. This study aims to investigate the correlation between sleep problems and child developmental trajectories.
This study utilized a prospective national cohort of 5006 children in Taiwan. The developmental inventories covering motor, cognitive, language, and socioemotional domains were collected through questionnaire-based in-person home interviews conducted at 3, 12, 24, and 36 months. Sleep problems data, encompassing bedtime regularity, sleep duration, and sleep quality, were collected at 3 and 12 months. Child developmental rate was assessed by analyzing the slope of developmental ability estimates over a period of time.
Bedtime regularity and high-quality sleep at 3 and 12 months were found to be significantly associated with intercepts across all domains (estimate = −0.196∼0.233, p < 0.033). Children with high-quality sleep at 3 months showed enhanced developmental slopes in socioemotional domains (estimate = 0.032, p < 0.001). Atypical sleep duration at 3 and 12 months had differential detrimental association with child development in various domains (estimate = −0.108∼-0.016, p < 0.048).
The relationship between sleep problems and child development exhibited variability based on the timing of exposure to these issues. Early exposure to low-quality sleep was significantly related to developmental functions and socioemotional developmental rate, potentially leading to increased developmental disparities as children age. Inadequate sleep duration in late infancy and excessive sleep duration in early infancy were both negatively associated with child development trajectories. Policymakers can use these findings to design targeted sleep programs for optimal child development.
New Dizziness Impact Measures of Positional, Functional, and Emotional Status Were Supported for Reliability, Validity, and Efficiency
2024, Archives of Rehabilitation Research and Clinical Translation
To calibrate the 25 items from the Dizziness Handicap Inventory (DHI) patient-reported outcome measure (PROM), using item response theory (IRT), into 1 or more item banks, and assess reliability, validity, and administration efficiency of scores derived from computerized adaptive test (CAT) or short form (SF) administration modes.
Retrospective cohort study.
Outpatient rehabilitation clinics.
Patients (N=28,815; women=69%; mean age [SD]=60 [18]) included in a large national dataset and assessed for dizziness-related conditions who responded to all DHI items at intake.
Not applicable.
IRT model assumptions of unidimensionality, local item independence, item fit, and presence of differential item functioning (DIF) were evaluated. Generated scores were assessed for reliability, validity, and administration efficiency.
Patients were treated in 976 clinics from 49 US states for either vestibular-, brain injury-, or neck-related impairments. Three unidimensional item banks were calibrated, creating 3 distinct PROMs for Dizziness Functional Status (DFS, 13 items), Dizziness Positional Status (DPS, 4 items), and Dizziness Emotional Status (DES, 6 items). Two items did not fit into any domain. A DFS-CAT and a DFS 7-item SF were developed. Except for 2 items by age groups and 1 item by main impairment, no items were flagged for DIF; DIF impact was negligible. Median reliability estimates were 0.91, 0.72, and 0.79 for the DFS, DPS, and DES, respectively. Scores discriminated between patient groups in clinically logical ways and had a large effect size (>0.8), with acceptable floor and ceiling effects (<15%), except for a floor effect for DPS (20.4%). DFS-CAT scores were generated using a median of 8 items; they correlated highly with full-bank scores (r=0.99).
The 3 dizziness impact PROMs demonstrated moderate to high reliability, were valid, and highly responsive to change; thus, they are suitable for research and routine clinical administration.

View all citing articles on Scopus

View full text

CommentaryOverview of Classical Test Theory and Item Response Theory for the Quantitative Assessment of Items in Developing Patient-Reported Outcomes Measures

Abstract

Background

Methods

Results

Conclusion

Introduction

Section snippets

Classical Test Theory

Item Response Theory

Discussion

Conclusions

Conflicts of Interest

Acknowledgments

Value Health

Value Health

Guidance for industry. Patient-reported outcome measures: use in medical product development to support labeling claims

Fed Reg

Construct validity: advances in theory and methodology

Ann Rev Clin Psychol

Psychological Testing

Approaches and recommendations for estimating minimally important differences for health-related quality of life measures

COPD

Multitrait scaling program: MULTI

Proceedings of the Seventeenth Annual SAS Users Group International Conference

Evaluating multi-item scales

Patient-Reported Outcomes: Measurements, Implementation and Interpretation

Some standard errors in item response theory

Psychometrika

Using Multivariate Statistics

Commentary
Overview of Classical Test Theory and Item Response Theory for the Quantitative Assessment of Items in Developing Patient-Reported Outcomes Measures