Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination

G Regehr; H MacRae; R K Reznick; D Szalay

doi:10.1097/00001888-199809000-00020

Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination

Acad Med. 1998 Sep;73(9):993-7. doi: 10.1097/00001888-199809000-00020.

Authors

G Regehr¹, H MacRae, R K Reznick, D Szalay

Affiliation

¹ Centre for Research in Education, Faculty of Medicine, University of Toronto, Ontario, Canada. g.regehr@utoronto.ca

PMID: 9759104
DOI: 10.1097/00001888-199809000-00020

Abstract

Purpose: To compare the psychometric properties of checklists, global rating scales preceded by a checklist, and global rating scales alone in assessing surgery residents' performances on an OSCE-like technical skills examination.

Method: In 1996, 53 general surgery residents with one to six years of postgraduate training participated in a performance-based examination of technical skills consisting of eight 15-minute stations (bench-model simulations of operative procedures in general surgery). Two qualified surgeons marked at each station, one using a task-specific checklist (C) and a subsequent global rating scale (Gc), the other using a global rating scale only (G).

Results: Interstation reliabilities measured by Cronbach's alpha were .79 for C, .89 for Gc, and .85 for G. A series of multiple regressions predicting level of training from test scores revealed an R2 of .584 for C alone, which increased to .711 when Gc was entered after (p < .001), and increased to .704 when G was entered after C (p < .001). However, R2 for Gc alone was .711, and for G alone was .704, neither of which changed when C was entered into the prediction (p > .10). The R2 for Gc and G predicting level of training (.725) was not significantly greater than that of either Gc or G alone. A very similar pattern of results was seen when C, Gc, and G were used to predict independent evaluations of the operative outcomes.

Conclusions: Global rating scales scored by experts showed higher inter-station reliability, better construct validity, and better concurrent validity than did checklists. Further, the presence of the checklists did not improve the reliability or validity of the global rating scale over that of the global rating scale alone. These results suggest that global rating scales administered by experts are a more appropriate summative measure when assessing candidates on performance-based examinations.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Educational Measurement / methods*
General Surgery / education*
Internship and Residency*
Psychometrics*
United States