Article Text

Download PDFPDF

Original research
Measuring differential attainment: a longitudinal analysis of assessment results for 1512 medical students at four Scottish medical schools
  1. David Hope1,
  2. Avril Dewar1,
  3. Eleanor J Hothersall2,
  4. John Paul Leach3,
  5. Isobel Cameron4,
  6. Alan Jaap1
  1. 1 Medical Education Unit, The University of Edinburgh College of Medicine and Veterinary Medicine, Edinburgh, UK
  2. 2 School of Medicine, University of Dundee, Dundee, UK
  3. 3 School of Medicine, Dentistry, and Nursing, University of Glasgow, Glasgow, UK
  4. 4 School of Medicine and Dentistry, University of Aberdeen, Aberdeen, UK
  1. Correspondence to Dr David Hope; david.hope{at}


Objective To measure Differential Attainment (DA) among Scottish medical students and to explore whether attainment gaps increase or decrease during medical school.

Design A retrospective analysis of undergraduate medical student performance on written assessment, measured at the start and end of medical school.

Setting Four Scottish medical schools (universities of Aberdeen, Dundee, Edinburgh and Glasgow).

Participants 1512 medical students who attempted (but did not necessarily pass) final written assessment.

Main outcome measures The study modelled the change in attainment gap during medical school for four student demographical categories (white/non-white, international/Scottish domiciled, male/female and with/without a known disability) to test whether the attainment gap grew, shrank or remained stable during medical school. Separately, the study modelled the expected versus actual frequency of different demographical groups in the top and bottom decile of the cohort.

Results The attainment gap grew significantly for white versus non-white students (t(449.39)=7.37, p=0.001, d=0.49 and 95% CI 0.34 to 0.58), for internationally domiciled versus Scottish-domiciled students (t(205.8) = −7, p=0.01, d=0.61 and 95% CI –0.75 to −0.42) and for male versus female students (t(1336.68)=3.54, p=0.01, d=0.19 and 95% CI 0.08 to 0.27). International, non-white and male students received higher marks than their comparison group at the start of medical school but lower marks by final assessment. No significant differences were observed for disability status. Students with a known disability, Scottish students and non-white students were over-represented in the bottom decile and under-represented in the top decile.

Conclusions The tendency for attainment gaps to grow during undergraduate medical education suggests that educational factors at medical schools may—however inadvertently—contribute to DA. It is of critical importance that medical schools investigate attainment gaps within their cohorts and explore potential underlying causes.

  • medical education & training
  • statistics & research methods
  • audit
  • education & training (see medical education & training)
  • health services administration & management

Data availability statement

No data are available. Due to the sensitivity of the dataset—including confidential information on student demographics and assessment scores—we are unable to share raw data.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This the largest study to date investigating longitudinal attainment gaps within undergraduate medical education.

  • By evaluating differential attainment longitudinally, the study tests whether attainment gaps are due to pre-existing differences or emerge during medical school.

  • The study has sufficient power to detect small/medium effects by pooling data from multiple cohorts and institutions.

  • All contributing schools were based in Scotland, and care should be taken when generalising to other contexts.

  • The study methodology cannot fully explain the mechanisms behind such attainment gaps


Promoting fairness in assessment is a key priority. Success in medicine should be determined by ability rather than background characteristics like ethnicity, sex or socioeconomic status (SES).1 There is an increasing emphasis on educational processes being ‘fair’ to candidates of diverse backgrounds: besides the legal and regulatory requirements,2 there is growing acceptance that evaluating fairness should be a routine part of test construction and assessment.3

Despite this, candidates continue to experience different outcomes in medical education and training because they have characteristics that lead to them being treated differently by staff, students and patients. The tendency for outcomes to vary in this fashion is usually termed differential attainment (DA). It influences every stage of medical education and is a global phenomenon with similar problems manifesting in a range of contexts.4 5 The varying treatment of some groups influences the likelihood of candidates completing medical school and affects selection methods.6–8 Performance on measures of success at or just beyond graduation shows a similar pattern,9 10 and for example, ethnically white UK graduates are given higher marks than non-white UK graduates in postgraduate examinations with typically moderate (d=0.22) effects.11 After graduation, ethnically non-white and female doctors experience barriers to success on a range of professional and educational outcomes.12–14 Students from underrepresented backgrounds are substantially less likely to be awarded high ratings from their clerkship directors, less likely to be given honours and less likely to be given honour society membership.15

Such compelling evidence has led to calls to establish the mechanisms of DA, but this is challenging. Many historical assumptions—such as the idea that examiners are biased against some candidate groups—remain commonly cited despite evidence to the contrary.16 17 Examiner bias does not appear to explain DA in postgraduate clinical examinations18 or written assessment.19 Qualitative research has emphasised a range of possible factors that can contribute to DA, including trust between trainers and trainees and the process by which those in difficulty are identified and referred to support networks.20–22 Other research has suggested that unconscious biases may alter training pathways or assessment in the workplace.4 13 23 24 Some authors now recommend a programmatic approach whereby each component of training is separately reviewed.25

As a result, evidence for the existence of DA is very strong, but we have so far only a limited understanding of the mechanisms by which it operates or even whether DA increases or decreases with time spent in medical education. Compounding this, while a great deal of research has been carried out on access to medical school and postgraduate assessment, relatively little work has evaluated DA on assessment during medical school. In a large meta-analysis, eleven of fourteen published studies examining undergraduate medical education used a single site, and two of the remaining studies used only two sites.11 Combined with the tendency to monitor attainment at only a single time point (typically finals), we know little of whether DA is of similar magnitude for different medical schools or remains stable during medical school.

This is an obvious limitation given the role of medical schools in providing the foundation of medical education and training. Due to the diversity of intakes, assessment choices, curriculum design and performance on postgraduate assessment,26 27 investigating DA at medical schools may help in several ways. By comparing different institutions, the effect of different recruitment strategies, curriculum types and policies on fairness in medical education can be explored. If the magnitude of DA is highly variable across institutions, it argues for a relatively larger role in medical school policy in creating DA. If DA remains consistent despite varying institutional contexts, it argues either that DA is explained by factors outside of medical school control or that no current approaches are identifiably superior or inferior. By examining the data longitudinally, it becomes possible to explore whether DA increases or decreases over time. If DA is present from the earliest part of medical education, this suggests different mechanisms than if DA is minimally present at the beginning but then grows with time. Such work can therefore significantly improve medical education and support a fairer experience for doctors.

In this study, we used data from four Scottish medical schools operating within a common regulatory framework. Our aim was to evaluate longitudinal DA across undergraduate medical education in 1512 medical students, exploring disability status, domicile, ethnicity and gender. Here, we report on the longitudinal effects of DA for these groups and the impact of DA on student rank.



Participants were undergraduate medical students who had attempted (but not necessarily passed) a major written (multiple choice question) assessment near the end of medical school. All institutions operated under the UK medical education system,2 and new graduates typically embarked on a 2-year foundation training programme as a doctor.

In total, 1512 medical students were eligible for inclusion in the study. To be eligible, a student had to (a) have attempted (but not necessarily passed) the final written assessment, (b) have made the attempt by the end of data collection and (c) have provided demographical information.

The 1512 students represented 74% of all available participants within the period of this study. Excluded subjects were typically those who had exited medical school before final assessment, experienced an interruption of study or intercalated close to the end of the study period and so had not yet sat finals. Due to the complexity of discontinuation, it is theoretically possible for a student to graduate up to 9 years after starting a 5-year programme, which makes confirmation of discontinuation challenging. Candidates who did not attempt final assessment prior to the end of the period of data collection are not included in any analyses presented here.

Table 1 summarises the partner schools, total sample sizes and assessments used. All schools offered 5-year MBChBs (Bachelor of Medicine and Surgery). The first 2 years of each programme involved an introduction to the fundamentals of medicine, anatomy, social issues around healthcare and working with peers. Each programme offered an opportunity to intercalate, whereby candidates spent an additional year studying a topic in greater depth before returning to the core programme. In the later years, candidates rotated through a series of clinical placements to develop the skills and knowledge necessary to work as a junior doctor.

Table 1

Participants, data ranges and assessments used

In each school, candidates sat a written assessment at the end of their first year. These featured multiple choice questions (MCQs) and, for two schools, short answer questions (SAQs). For each question, candidates were presented with a scenario and question. For MCQs, candidates selected the correct answer from a list, whereas for SAQs, candidates provided a short, written answer. The assessment was blueprinted based on programme learning outcomes and standard set by experts familiar with the curriculum.

Near the end of medical school, candidates sat another written assessment. Three schools delivered this in the final year, while one (the University of Aberdeen) delivered it at the very end of the prefinal year. The blueprinting and standard setting process was the same as in the early assessment.

In each case, the assessments acted as a progression barrier: candidates needed to achieve a satisfactory mark to progress to either second year or graduation. A review by the authors identified that although there were some variations in curricula and teaching methods, there were no significant differences in content and structure of assessments between programmes that would impact cross-school comparisons of DA.

Table 2 describes the participants according to important demographical characteristics. We report whether the candidate did or did not have a known disability, where they were domiciled before starting medical school, their ethnicity and their gender. All recorded data were self-reported. For ethnicity and domicile, we aggregate data across many subcategories into broad groups such as ‘Scottish domicile’ or ‘white.’ While a more detailed breakdown would be helpful, the small numbers in many groups prohibit this. The demographical characteristics selected for study are based partly on the concept of a ‘protected characteristic’ for which there is a legal obligation to promote equality within the UK,28 partly on demographical characteristics known to be important from past research and partly on availability of data. To give two examples of data availability, marital status and sexual orientation had levels of missingness that were too high to achieve necessary levels of power. The four categories described here (known/no known disability, international, non-EU/Scottish domicile, non-white/white and female/male) represent all those selected for full analysis, and all analyses have sufficient power to detect medium effects. We selected Scottish (as opposed to the whole UK) domicile due to Scottish-domiciled candidates having already experienced the Scottish legislative and educational framework and having selected a medical school relatively close to home. Furthermore, differences in the funding approach in Scotland compared with the rest of the UK made merging the two groups less defensible. Non-Scottish-domiciled UK students were included in the other comparisons, and so for example, an English-domiciled student who provided valid information on gender would have been reported for that analysis.

Table 2

Demographical characteristics of the study sample

SES was recorded in the dataset in two forms. First, candidates had the opportunity to list parental occupation. Over 90% of candidates did not fill this in. A second proxy for SES was candidate postcode, which can be converted into an index of multiple deprivation.29 However, it was not possible to effectively compare Scottish, non-Scottish UK and international measures of SES within a single dataset. As such we did not explore this covariate further in the present study.

Data protection and ethics

This project represented a considerable challenge under data protection legislation and required a careful and thorough evaluation of ethical issues. To ensure data protection, a designated team member undertook an honorary contract with each partner and worked in tandem with a data custodian at that school. This meant individualised data were never transferred outside of the school servers, and a thorough anonymisation protocol was used to verify that no ‘unique’ combinations could identify candidates from their data patterns. Ethical approval was granted by the ethics committee for the College of Medicine and Veterinary Medicine at the University of Edinburgh (reference: 2018/7) and then separately approved by an ethics board and a data protection officer at each of the other schools. All participants gave informed consent. Prior to data analysis, all partners agreed to disseminate the results in public and to representatives of the study population: in this case, medical student organisations.

When describing inequities, researchers must ensure individuals are described fairly and appropriately, without discriminatory language. Throughout this paper, we have used language that shows that group membership itself does not cause an attainment gap and is never a direct determinant of performance and instead likely reflects systemic societal issues. We have provided some additional references that may be helpful in exploring language choice when describing historically under-represented groups.4 20

Patient and public involvement

The study was carried out exclusively on medical students and did not involve patients in any way. As such, there was no patient or public involvement.

Statistical analyses

Each medical school has a locally designed curriculum and assessment environment. We investigate written assessment as the most comparable form of assessment, as the available clinical examinations vary considerably across the schools in both timing and format. To allow like-for-like comparisons across different written assessments, we converted each cohort of data to z-scores.30

A z-score is a standardised measurement, where a score of zero indicates the candidate has received exactly the mean mark on the assessment and a score of +/−1 indicates they have received a mark one SD above or below the mean, respectively. This is analytically helpful because it allows for comparisons where relative (rather than absolute) differences are important. If a candidate from one medical school receives a mark of 75 and a candidate from another medical school receives a mark of 70 on two different assessments, it is difficult to know who is more capable. But if the z-score for each candidate is zero, this indicates they are of the same level of ability relative to their peers and that they are both average.

We used the Shapiro-Wilk test to model residual values to test for normality.31 Although the normality parameters were violated (W=0.99 and p<0.001), further investigation suggested that parametric testing would still be more appropriate as parametric tests are more effective at minimising the risk of false positives where the group sample sizes and SD vary across groups.32 Sample sizes were sufficient to detect small effects at 80% power for ethnicity, gender and domicile, whereas for disability status, the unequal group sizes and small numbers of students self-reporting a disability allowed for only medium effects at 80% power.33 Due to the low sample sizes within each medical school, it was not feasible to compare intermedical school variability with sufficient power. Likewise, it was not possible to compare intersectional DA (eg, ethnicity and gender). We used Welch’s t-test for significance testing as a more robust alternative to other t-tests.34 All analyses were carried out using R.35

Design choices

We made several design choices that influence the final dataset. Most importantly, by only including candidates who reach final assessment, we exclude the majority of those who experienced major difficulties early in their studies. However, the only alternative is to either measure graduation rates, which prevents granular analyses as the overwhelming majority of students pass medical school,36 or attempt some form of imputation to estimate final performance of candidates who never reached that stage of education, with significant uncertainty over the accuracy of such estimates. We opt for a simple approach of reporting data only where fully available. One consequence of this is that variability is higher in final assessment than in first year, with more candidates performing poorly, so most z-score change values were negative. For example, it would be possible for a candidate to receive an A in the first year and an F in the final year and participate in our study, but it would not be possible for the reverse to be true—unless the student successfully resat assessment and then completed within the specified timeframe. This can be considered a form of ‘survival bias’, and approaches to the problem always require trade-offs.37

To investigate survival bias, we compared the ratios of those who did to those who did not provide final year assessment results for each group. For example, we compared the ratio of non-white/white completers to non-white/white non-completers. No differences in the ratios were detected for any studied group. This likely reflects the fact that non-completion (by the end of the present study) was due to a variety of factors and did not in itself indicate academic difficulty.

Following this, we carried out a number of comparisons. First, we calculated the z-score for each student in their first year and then the final assessment. We explored the equivalence of school. We compared z-score change between groups to see whether attainment gaps were growing or shrinking during medical school. Finally, we ranked all candidates to see who would appear in either the top or bottom decile for the final assessment.


We first tested whether the performance profiles of each school were sufficiently similar to pool data into a single sample. We compared the shapes of the distributions, frequencies of outliers and overall variability of each cohort. After confirming the equivalence of the cohorts, we pooled all data into a combined sample of 1512 students.

Table 3 provides a summary of (a) the z-score for each demographical characteristic per assessment, (b) the relative change in z-score over time and (c) whether the z-score change between groups is significant. For the present study, we are not interested in the attainment gap at either the start or end of medical school—but whether the magnitude of the gap changes over time. We found that the gap grew significantly for white versus non-white students (t(449.39)=7.37, p=0.001, d=0.49 and 95% CI 0.34 to 0.58), for internationally domiciled versus Scottish-domiciled students (t(205.8) = −7, p=0.01, d=0.61 and 95% CI −0.75 to −0.42) and for male versus female students (t(1336.68)=3.54, p=0.01, d=0.19 and 95% CI 0.08 to 0.27). No significant differences were observed for candidates with versus without a known disability.

Table 3

Z-score change during medical school study

For the three significant analyses, non-white, internationally domiciled and male candidates were awarded a relatively higher score at the start of medical school. By the end of medical school, they were respectively awarded a lower score than white, Scottish-domiciled and female students. The effect size was medium when testing ethnicity and domicile and small for testing gender. In summary, non-white, internationally domiciled and male students experienced a relative decline in their achieved marks at medical school, which cannot be explained by low attainment before or in the first year of medical school.

Finally, we estimated how often medical students of different demographics would appear in the top and bottom decile based on their z-scores versus their expected frequencies based purely on how many existed in each category. Table 4 summarises the details.

Table 4

Rankings of top and bottom decile by demographical characteristic

Decile 1 is the highest-scoring decile, and decile 10 is the lowest-scoring decile. Students with a known disability, Scottish students and non-white students are over-represented in the bottom decile and under-represented in the top decile. Students with no known disability and white students are over-represented in the top decile and under-represented in the bottom decile. International students and male students are over-represented in both the top and bottom decile. Female students are under-represented in the top and bottom decile.

This analysis shows that many groups exhibit differences not just in mean performance but also in variability, with some candidates being under-represented and over-represented at the extremes of the distribution.


Statement of principal findings

DA exists within Scottish medical schools, with small to medium effects. The analysis described here demonstrates both the considerable difficulty in organising datasets to longitudinally investigate DA and the ongoing importance of such work. Even among successful medical students—and the overwhelming majority of those described in the present dataset have become doctors—DA exists. The fact that many attainment gaps grow during medical school suggests educational factors within medical schools may promote DA.

Strengths and weaknesses of the study

It is important not to overstate the findings. Small to medium effect sizes are consequential and impact student education, but there remains considerable variance between students of all groups. In this dataset, candidates across the attainment continuum were present in every group. In addition, the core purpose of medical education—graduating a safe doctor—has been met for almost all participants in the dataset. The gaps observed here must be placed in this context. Finally, as until now we have operated in an environment with almost no published data, there is a risk that organisations that attempt to directly engage with the problem of DA are criticised for the differences they reveal, which may in turn drive reluctance to explore the issue in depth. It is important that stakeholders support the exploration of DA across the sector.

This study represents a novel attempt to understand DA not as a fixed factor, but as a changing influence on student performance and behaviour. The sample size and range suggest we can be confident the findings are potentially generalisable to other UK medical schools. By opting for a straightforward methodology, we believe the findings are robust and can inform future policy.

Despite this, there are limitations. The challenges of organising a longitudinal study using data from a range of institutions with varying outcome measures should not be understated. We have made design choices—such as excluding those who failed before reaching finals—which may influence the pattern of results. Due to the relatively small sample sizes of some groups, it was not possible to explore ‘intersectional’ DA for, for example, candidates who were non-white and female.38 Due to the nature of the available data on SES, we were not able to include SES as a covariate in the present study. All candidate demographics were self-reported, and so, some information could theoretically be inaccurate. While we consider the curricula and assessment of the institutions to be sufficiently similar to allow for a combined analysis, it is possible that local factors may have created some unidentified sources of variance.

The lack of a shared, standardised assessment across schools required the use of z-scores (or an equivalent method), and the presence of a standardised assessment, such as the forthcoming UK Medical Licensing Assessment, would have greatly simplified the analysis.39

Data collection was challenging, and it was clear that there was no expectation during data creation that assessment-level data would be required 5 or 10 years after the assessment was sat. Medical education data should be thought of as ‘perishable’—it is possible that even relatively recent data are being lost, overwritten or rendered inaccessible. If medical educators wish to investigate DA across time, it is critical that better data collection practices are implemented, and historic data sources should be secured and documented in national-level databases.40 The alternative is that we may establish excellent prospective analyses for which we will have no useful data for up to a decade.

Comparison with other studies and unanswered questions

DA exists across medical education systems across the world and should always be considered when designing teaching and assessment.4 5 Our findings support and extend past work exploring DA in postgraduate medical education9 12 13 21 and at medical school.15 24 Importantly, our study also confirms that we remain unclear, as a sector, on the mechanisms behind DA.18 19 All organisations involved in medical education must proactively consider how they approach fairness in medical education and evaluate the impact of DA.

The limitations described above are logical opportunities for future work. Exploring the impact of SES, analysing intersectional characteristics and studying those who do not graduate may offer insights into both the scope and mechanisms of DA. Exploring candidate domicile in a more granular fashion (such as measuring the distance between home and their selected medical school) may be helpful, especially alongside measurements of SES. Importantly, the design challenges highlighted here will persist until institutions develop rigorous frameworks to investigate long-term changes in student performance.

Implications and conclusions

The present study demonstrates DA changes in magnitude during undergraduate medical education. Combined with evidence that candidates of some groups are less likely to be given awards15 and more likely to experience prejudice,24 it is very plausible that some of the mechanisms of DA are located in, or caused by, aspects of medical education within medical schools. As such, institutions must consider the possibility that their actions contribute to DA and develop appropriate policies for investigation and correction.14

Data availability statement

No data are available. Due to the sensitivity of the dataset—including confidential information on student demographics and assessment scores—we are unable to share raw data.

Ethics statements

Patient consent for publication

Ethics approval

Ethical approval was granted by the ethics committee for the College of Medicine and Veterinary Medicine at the University of Edinburgh (reference: July 2018) and then separately approved by an ethics board and a data protection officer at each of the other schools. All participants gave informed consent. Prior to data analysis, all partners agreed to disseminate the results in public and to representatives of the study population: in this case, medical student organisations. This information is reproduced in the main text.



  • Twitter @e_hothersall

  • Contributors Dr IC, Dr EJH and Professor JPL were each responsible for sourcing data, describing the context and exploring the results in their institutions. AD was responsible for sourcing data at her institution and then collating all the data and running the initial analyses. Dr DH organised the project, designed the analyses, was primarily responsible for writing the paper and is the guarantor for the content. Dr AJ acted as supervisor for all the project work and reviewed the analyses. All authors have separately reviewed the manuscript and provided input in developing the final analyses and paper.

  • Funding The Scottish Medical Education Research Consortium (SMERC) provided funding to allow the research project to take place. The funding was used to pay for administrator and researcher time to collate and analyse the data. The funder had no direct input into the analyses chosen or the reporting of the results. The researchers were independent from the funder, and all researchers had access to the data and can take responsibility for the integrity of the data and the accuracy of the data analysis.

  • Competing interests All authors have completed the ICMJE Uniform Disclosure Form at and declare that all authors had financial support from the Scottish Medical Education Research Consortium (SMERC) for the submitted work, no financial relationships with any organisations that might have an interest in the submitted work in the previous three years and no other relationships or activities that could appear to have influenced the submitted work

  • Patient and public involvement Patients and/or the public were not involved in the design, conduct, reporting or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.