Order effects in high stakes undergraduate examinations: an analysis of 5 years of administrative data in one UK medical school

Objective To investigate the association between student performance in undergraduate objective structured clinical examinations (OSCEs) and the examination schedule to which they were assigned to undertake these examinations. Design Analysis of routinely collected data. Setting One UK medical school. Participants 2331 OSCEs of 3 different types (obstetrics OSCE, paediatrics OSCE and simulated clinical encounter examination OSCE) between 2009 and 2013. Students were not quarantined between examinations. Outcomes (1) Pass rates by day examination started, (2) pass rates by day station undertaken and (3) mean scores by day examination started. Results We found no evidence that pass rates differed according to the day on which the examination was started by a candidate in any of the examinations considered (p>0.1 for all). There was evidence (p=0.013) that students were more likely to pass individual stations on the second day of the paediatrics OSCE (OR 1.27, 95% CI 1.05 to 1.54). In the cases of the simulated clinical encounter examination and the obstetrics and gynaecology OSCEs, there was no (p=0.42) or very weak evidence (p=0.099), respectively, of any such variation in the probability of passing individual stations according to the day they were attempted. There was no evidence that mean scores varied by day apart from the paediatric OSCE, where slightly higher scores were achieved on the second day of the examination. Conclusions There is little evidence that different examination schedules have a consistent effect on pass rates or mean scores: students starting the examinations later were not consistently more or less likely to pass or score more highly than those starting earlier. The practice of quarantining students to prevent communication with (and subsequent unfair advantage for) subsequent examination cohorts is unlikely to be required.

Outcomes: (1) Pass rates by day examination started, (2) pass rates by day station undertaken and (3) mean scores by day examination started.
Results: We found no evidence that pass rates differed according to the day on which the examination was started by a candidate in any of the examinations considered ( p>0.1 for all). There was evidence ( p=0.013) that students were more likely to pass individual stations on the second day of the paediatrics OSCE (OR 1.27, 95% CI 1.05 to 1.54). In the cases of the simulated clinical encounter examination and the obstetrics and gynaecology OSCEs, there was no ( p=0.42) or very weak evidence ( p=0.099), respectively, of any such variation in the probability of passing individual stations according to the day they were attempted. There was no evidence that mean scores varied by day apart from the paediatric OSCE, where slightly higher scores were achieved on the second day of the examination.
Conclusions: There is little evidence that different examination schedules have a consistent effect on pass rates or mean scores: students starting the examinations later were not consistently more or less likely to pass or score more highly than those starting earlier. The practice of quarantining students to prevent communication with (and subsequent unfair advantage for) subsequent examination cohorts is unlikely to be required.

INTRODUCTION
High stakes undergraduate medical assessments determine whether a student may or may not progress to medical qualification. As such, it is essential that the examination processes are valid, reliable, transparent, and fair. Medical schools worldwide use objective structured clinical examinations (OSCEs) 1 2 in this context to assess students' clinical and communication skills. Such examinations aspire to ensure robust procedures by requiring all candidates to undertake the same clinical stations, to be completed within predetermined time limits and assessed using the same marking scheme. 2 To accommodate all candidates for these examinations, many medical schools are compelled to run an OSCE repeatedly, with students scheduled to undertake the examination in sequential Strengths and limitations of this study ▪ The study data set, containing 5 years of data from three separate objective structured clinical examinations (OSCE) in one medical school, is the largest to date to be analysed to investigate the impact of examination schedule on examination performance. ▪ The varying nature of examinations between medical schools makes it challenging to conduct cross-institutional analyses, but the inclusion of only one institution may limit the generalisability of our findings. ▪ Ideally, we would consider the impact of examiners on variations in examination performance according to day: however, this is not straightforward to satisfactorily accomplish as examiner effects may be confounded by subject and station difficulty, and could change across different years.
groups over a number of days. This leads to two concerns for students and medical schools. The first concern relates to security breaches: it is possible that students in earlier scheduled examination times might tell those in later times about the content of the examination, either advantaging (or potentially disadvantaging) those students who come later. To reduce the potential for this, some medical schools routinely quarantine earlier student examination groupings from later ones until an overall OSCE is complete. This involves challenges for the medical schools in accommodating quarantined students (especially if an OSCE runs over more than 1 day) and in ensuring effective quarantine, such as restricting the use of smart phones/watches and other mobile devices. In addition, it imposes a significant burden on quarantined students, which is not shared by their colleagues in later cohorts.
With OSCEs' focus on assessment of skills, rather than knowledge, some have argued that the short time lag between examination groupings is insufficient for any briefing about examination content to lead to an improvement in the performance of candidates in later examination times. 3 4 For example, a 1989 study reported no evidence of information sharing affecting the performance among US fourth year medical students in OSCEs (which, in this instance, took place over a period of several weeks). 5 Others remain concerned about the likely impact of security breaches. 6 Collusion between third year medical students on the content of OSCEs has previously been identified through monitoring of discussions between students on an electronic discussion board, in which concerns were expressed about this taking place. 7 An experimental study modelling the effect of a severe security breach (ie, the leaking of checklists or the provision of coaching for three of the six clinical stations) found that students who had received additional information outperformed the control group by around 7%. 8 The second concern relates to whether examination grouping, in itself and independent of possible security breaches, may influence a candidate's performance. One explanation for this is that students in different examination groups may perform variably as a result of psychological pressures relating to the timing of their examination or changes in aspects of examination process between groups. An additional or alternative explanation is that examiners' scoring may change over time, becoming either more or less generous across examination groupings. The core question here is whether there is something inherent about being in the first versus subsequent examination groups that places candidates at an advantage or disadvantage in comparison to their peers.
Small but inconsistent effects have previously been found for the time of day (morning or afternoon) and day on which second-year medical students undertake OSCEs in a US medical school, assessed on a pass/fail basis. 9 However, other research has demonstrated little impact of the day of the examination (in dental OSCEs in the Netherlands, 10 in undergraduate medical OSCEs in Spain 11 ) or the timing of the examination (in undergraduate medical OSCEs in Canada 12 ). These studies were relatively small in scale (with 772, 9 463, 10 172 11 and 69 12 students, respectively) and examined a variety of OSCE set-ups, including the use of parallel streams 12 and non-consecutive days with no quarantining of students. 10 Given continuing concern about this issue, there is a need for larger-scale and up-to-date examination of the impact of examination order on performance in modern OSCE settings. The purpose of this study was, therefore, to investigate the association between the scores achieved by students in high-stakes OSCEs in one UK medical school, the University of Cambridge School of Clinical Medicine, and the examination grouping to which they were assigned.

METHODS
We conducted a retrospective analysis of three highstakes OSCEs conducted in the School of Clinical Medicine between 2009 and 2013. Students in sequential examination groupings were not quarantined in any of these examinations. Data were retrieved by a single member of the research team, with the assistance of the School examinations administration team. Student and examiner identities were removed on data retrieval and replaced by a unique, anonymised study identifier. Our data covered: 1. Year 5 obstetrics and gynaecology OSCE: for this examination, individual candidates sat all stations in 1 day and in each year the OSCE was completed over 2 consecutive days. 2. Year 5 paediatrics OSCE: for this examination, individual candidates sat all stations in 1 day and in each year the OSCE was completed over 2 consecutive days. 3. Year 6 simulated clinical encounter examination (SCEE) OSCE: for this examination, individual candidates sat stations over 2 days and in each year the OSCE was completed in 3 days. Candidates were evenly split between sitting stations on days 1 and 2, days 1 and 3, and days 2 and 3, with the same number of candidates taking each station on each day. For all OSCEs, the content of each station, including the question wording, did not vary between circuits or between days.

Data
For each OSCE examination, data were obtained for individual candidates and individual stations, alongside the overall pass mark for each examination in each year. Data were obtained for each OSCE for 2009-2013. For each candidate, we knew whether they passed the examination, their mean score, whether they passed each individual station of the examination and their score for each individual station. We had further information on the subject of each station, a pseudonymous code for the examiner for each station and the day and time of each station. There were no missing data.
In the case of all three examinations, in order to pass, students were required (1) to meet the overall examination pass mark, as defined by the borderline group method, 13 and (2) to additionally pass a minimum of 50% of individual stations: this ensures that poor performance in several stations cannot be compensated for by exceptionally high performance in one or two other stations.

Statistical analysis
We used a series of models to investigate various aspects of potential order effects in OSCEs, which we applied separately to each of the three examinations under consideration ((1) Year 5 obstetrics and gynaecology OSCE, (2) Year 5 paediatrics OSCE and (3) Year 6 simulated clinical encounter examination OSCE). Each model included data from all candidates and all years for that OSCE.
Our primary question was whether the probability of passing each OSCE varied according to the day on which the examination was undertaken. We used a logistic regression model, adjusting for the year of examination, to investigate whether the probability of passing each OSCE varied according to the day on which each candidate started the examination (Model 1). We recognise, however, that the very high overall pass rate (>97%) limits the power of this approach. For example, if the true pass rate on day 1 was 96.5%, and the true pass rate on day 2 was 98.5%, the power to detect this difference, an OR of 2.4, would be <50%.
Our second question (less limited by power than our primary question) was whether the probability of passing individual stations varied according to the day on which those stations were attempted. This differs from Model 1, which considered the first day on which candidates attempted any stations (remembering that, in some OSCEs, candidates were required to be examined over 2 days). We used a mixed-effects logistic regression model adjusting for year (fixed effect) and for clustering of individual station results within candidates using a random effect (Model 2).
Our third question was whether any observed effects of the day on which stations were attempted were consistent from year to year: to consider this, we augmented Model 2 by including an interaction between day and the year-cohort (Model 3).
The above analyses focus on the nature of these OSCEs as pass or fail exams; as long as candidates do well enough to pass, their actual overall score is not of great importance. However, in the University of Cambridge Medical School, as elsewhere, candidates may be awarded a Pass with Distinction if they score particularly highly, with the potential for subsequent impact on their career opportunities. We, therefore, used a linear regression model to investigate whether there were differences in the overall mean score between start days of the OSCE (as in Model 1), adjusting for year (Model 4). Overall pass rates were very high: of 2331 total examinations taken across all three OSCEs, 2273 (97%) were passed (table 1). Pass rates appeared broadly similar across start days, as did mean scores. Pass rates for The size of these differences can be contextualised by comparing them to the year-to year differences and to the variability between candidates. The differences between the day on which the stations were attempted were smaller than the differences between year-cohorts (table 3), and much smaller than the variability between candidates (captured by the random effect). Using the SD of the random effect we estimate ORs comparing the 'best' candidates (defined as those who are better at passing OSCE stations than 97.5% of other candidates) to the 'worst' candidates (defined as those worse at passing OSCE stations than 97.5% of other candidates at between 16.48 (paediatrics OSCE) and 23.18 (SCEE OSCE).  Probability of passing individual stations according to the day on which those stations were attempted, within each examination year (Model 3) All three OSCEs had at least 1 year in which we found a large, highly statistically significant difference in the odds of passing individual stations according to the day on which stations were attempted (table 4). For the SCEE and the obstetrics and gynaecology OSCEs, the higher probability of passing on a particular day in 1 year were matched by a lower probability of passing on that day in a different year. For example, in the 2010 SCEE, candidates' odds of passing stations on day 2 were twice those on day 1, but in the 2012 SCEE candidates' odds of passing stations on day 2 were half those on day 1. For the paediatrics OSCE there was a significant increase in pass rate for day two (p<0.001) in 2010 only: for all other years there was no clear evidence of a difference in pass rates between days.

Differences in overall mean scores between start days of the OSCE, adjusting for year (model 4)
Candidates who sat the paediatrics OSCE on the second day of the examination tended to have higher scores, with a mean score difference of 1.8 (95% CI 0.8 to 2.8) (table 5). We note, however, that this is small compared to the SD of paediatrics OSCE scores (8.2).
There was no evidence that mean scores on either the SCEE or obstetrics and gynaecology OSCE varied between candidates starting on the first and second days of the examination (table 5).

DISCUSSION
While our analyses identified some potential order effects in the OSCEs under investigation, these were inconsistent in direction across the three examinations and relatively small. For all OSCEs, there was no evidence that overall pass rates varied according to the day on which the examination was started, although our certainty is limited by the very high overall pass rates. We did not seek to see if overall order of examinations taken made a difference; however, if this had been the case, we would expect it to have been highlighted in our analysis by the day the examination was started.
There was some evidence for variations in pass rates for individual stations according to the day on which stations were attempted. In particular, we found that in a number of years, there were strong differences in the chance of passing individual stations according to when they were taken. However, these effects were highly inconsistent across years and tended to cancel each other out when considering the 5 years of data together. Only the paediatric OSCE provided any substantive evidence of an order effect over the 5-year period, and even then it was dominated by a single year. The inconsistencies between examinations and years suggest that differences were unlikely to represent a true order effect, but rather other factors that varied between years. These could potentially be attributable to different examiners, but could also be attributable to uncontrollable factors such as the weather or the traffic (see below). It is also noteworthy that the differences by day were smaller than the year-to-year differences and much smaller than the differences between candidates.
A particular strength of this analysis is the inclusion of large numbers of candidates and stations, across 5 years of the conduct of three separate examinations with no differences in content between days for each examination within each year. The completeness and level of detail of data available on candidates, stations, and examinations enabled us to undertake a comprehensive study of the association between the time of examination and overall pass rates and mean scores. The varying nature of examinations between medical schools makes it challenging to conduct cross-institutional analyses, although the inclusion of only one institution may limit the generalisability of our findings. We acknowledge that differences between examiners may explain some of the observed differences between days, and ideally we would consider the impact of potential confounding by examiner. However, this is not straightforward to satisfactorily accomplish: examiner effects may be confounded by subject and station difficulty, and could change across different years. Additionally, the simulated patients and patients used within the examinations may vary between circuits and days; we were not able to investigate the potential impact of this, as we did not hold information about the simulated patients and patients involved in these examinations. However, we note that all simulated patients are trained to a high standard and discuss each station in detail in advance to minimise variation in performance.
Our findings reflect those of the next largest analyses of OSCEs conducted on this issue to date, in which the effect of day and time of examination was examined over 4 years of second-year medical pass/fail examinations (1990)(1991)(1992)(1993) in one US medical school. 9 Here, while differences in the pass rate were identified according to the day of examination and the time of examination (morning or afternoon), none of these were consistent between years. 9 The authors concluded that there was little evidence that test security between repeated examinations was a concern. 9 Two other studies examining the effect of day on OSCE pass rates found no strong evidence of variation: one, an examination of dental OSCEs in 463 students across 4 days of the week, 10 the second involving 172 final year medical students from three medical schools across 8 days. 11 We can only speculate for the causes of inconsistent within-year, by-day variations in examination performance in our analyses. However, we would suggest that issues such as variations in the weather according to examination days are not entirely farfetched as potential explanations. For example, psychological experiments have demonstrated that weather may affect a wide range of moods and behaviours, including risk-aversion, 14 memory 15 and concentration, 16 all of which might affect the performance of either students or examiners. The variations in the odds of passing between years are more reasonably likely to be as a result of differences in cohort ability.
These findings demonstrate clearly that there is little need to quarantine students across different cohorts of OSCE examinations. Such quarantining is recommended in some quarters, 17 yet it is costly in time and money. It is apparent that, even though students in later cohorts may have the opportunity to discuss and review examination content with those who have already undertaken the examination, such discussions -if they occur-do not significantly affect the performance in examinations.  Contributors JBu designed the study, oversaw data collection, contributed to the interpretation of the data as well as drafted and revised the paper. GA designed the study, wrote the statistical analysis plan, contributed to the interpretation of the data and revised the paper. MB analysed the data, and drafted and revised the paper. RE carried out data collection and data entry, and commented on draft versions of the paper. JBe designed the study, oversaw data collection, contributed to the interpretation of the data and revised the paper. MG contributed to the design of the study and interpretation of the data, and revised the paper. JBu is the guarantor.

Author affiliations
Funding This work was funded by the University of Cambridge School of Clinical Medicine. The authors experienced no influence from the funding institution regarding the execution and analyses of this study, the interpretation of the data or the decision to submit the study findings. Ethics approval The study was approved in this form by the University of Cambridge Psychology Research Ethics Committee.

Competing interests
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Statistical code and data set are available from the authors on request: contact JBu ( jab35@medschl.cam.ac.uk).
Open Access This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http:// creativecommons.org/licenses/by/4.0/