Introduction Predicting medical science students’ performance on high-stakes examinations has received considerable attention. Machine learning (ML) models are well-known approaches to enhance the accuracy of determining the students’ performance. Accordingly, we aim to provide a comprehensive framework and systematic review protocol for applying ML in predicting medical science students’ performance on high-stakes examinations. Improving the current understanding of the input and output features, preprocessing methods, setting of ML models and required evaluation metrics seems essential.
Methods and analysis A systematic review will be conducted by searching the electronic bibliographic databases of MEDLINE/PubMed, EMBASE, SCOPUS and Web of Science. The search will be limited to studies published from January 2013 to June 2023. Studies explicitly predicting student performance in high-stakes examinations and referencing their learning outcomes and use of ML models will be included. Two team members will first screen literature meeting the inclusion criteria at the title, abstract and full-text levels. Second, the Best Evidence Medical Education quality framework rates the included literature. Later, two team members will extract data, including the studies’ general data and the ML approach’s details. Finally, the information consensus will be reached and submitted for analysis. The synthesised evidence from this review provides helpful information for medical education policy-makers, stakeholders and other researchers in adopting the ML models to evaluate medical science students’ performance in high-stakes exams.
Ethics and dissemination This systematic review protocol summarises findings of existing publications rather than primary data and does not require an ethics review. The results will be disseminated in publications of peer-reviewed journals.
- health informatics
- education & training (see medical education & training)
- general medicine (see Internal Medicine)
- medical education & training
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
- health informatics
- education & training (see medical education & training)
- general medicine (see Internal Medicine)
- medical education & training
STRENGTHS AND LIMITATIONS OF THIS STUDY
Our systematic review will be the first one explicitly focusing on machine learning methods for predicting medical science students’ performance on high-stakes examinations.
The systematic review will use the rigorous methodology outlined in the Cochrane Handbook. The results will be reported per the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement.
The Best Evidence Medical Education quality framework will be used to assess the study’s quality.
A potential limitation will be using students’ learning outcomes to predict their performance in high-stakes examinations since other approaches, such as university admissions tests and entrance examinations, are not addressed.
Medical sciences universities, schools, faculties and academic hospitals worldwide seek methods that accurately predict students’ performance on high-stakes examinations based on their demographics and scores in the related courses. High-stakes examinations are one of the primary methods for assessing thousands of medical sciences students’ competency. For instance, in the USA, the Comprehensive Osteopathic Medical Licensing Examination of the USA is designed to assess osteopathic medical students and residents.1 Another high-stakes examination is the medical licensing assessment, evaluating the core knowledge, skills and behaviours required of medical practitioners in the UK.
Although high-stakes examinations are widely administered in various countries, concerns have been raised about their utility as the sole assessment method. Several studies indicated that these tests are susceptible to potential sources of errors.2 Hence, establishing a standard with high validity and reliability is in demand. Establishing this standard, however, requires initial identification and reduction of measurement errors and biases. Furthermore, improving test design or test item variation can enhance the quality of judgements and the overall examination procedure.
Meanwhile, new techniques, such as artificial intelligence and machine learning (ML), are emerging to extract valuable information and discover new patterns from databases.3 4 Many educational data mining (EDM) methods have been proposed to predict students’ achievements in comprehensive tests. However, higher education authorities have slowly adopted predictive ML methods for predicting student outcomes, such as graduation, dropout, licensure exam passage rates and acceptance into academic and career positions. As a result, evaluating the existing ML models in predicting high-stakes exam results to persuade higher education authorities to apply ML models seems critical.
Several systematic reviews have been conducted to tackle predicting medical science students’ performance on high-stakes examinations. Al-Alawi et al 5 systematically reviewed the pre-admission variables and selection criteria that predict student success in nursing programmes in the USA. It focuses on cognitive predictors in the nursing school admission procedure.5 Velez6 conducted a search using only the terms ‘ABSITE’ and ‘American Board of Surgery In-Training Examination’.6 In their systematic review and meta-analysis, Wolden et al.7 determined the relationships between the performance in the first attempt of National Physical Therapy Examination (NPTE), physical therapist applicant variables and physical therapist student variables.7
Rationale for this study
The previous systematic reviews concentrated on the admissions process or special examinations, and none explicitly focused on ML methods for predicting medical science students’ performance on high-stakes examinations. Moreover, a consensus about the essential features, practical preprocessing, ML models and their setting and evaluation metrics is required. As a result, it seems critical to provide a comprehensive framework and systematic review of ML applications in this field.
The high-stakes examination is a licensing exam in which thousands of students participate to measure their competency.3 Their assessment and accountability policies impact many people and institutions, such as universities, embassies, students, educators, politicians and educational stakeholders. Moreover, if these test results are used to make important decisions about students, teachers, administrators and schools, they are considered high-stakes examinations.4 According to Johnson et al, a test or test programme is high-stakes if its results determine a student’s failure or success in entering a higher grade, enabling or preventing a graduate student from attending higher educational school, and fund management.8 Therefore, the concept of ‘high-stakes examinations’ refers to tests that are administered nationwide and worldwide in a standardised way.9 10 These tests are significantly practical in educational policies and essential for future workers in the healthcare system. Therefore, since healthcare systems’ quality directly relates to the quality of medical education,11 the primary concern is to ensure the safety and quality of healthcare systems in a highly complex 21st century.3 As a result, many international medical training programmes use high-stakes and gate-keeping examinations.2
Predictive analytics is the ‘brain’ behind automated decision-making tools for validating students’ performance in various exams.12 This data-driven approach focuses on identifying patterns that may lead to accurate predictions about students’ outcomes in examinations while keeping the mechanisms in a ‘black box’.13 Higher education institutes and universities inevitably have full access to students’ information. Although some studies specify input features for predicting student success rates in high-stakes tests in the medical sciences,14 assembling and using such information is critical to developing an accurate predictive model for higher education students.
One of the main streams for predicting the outcome of various educational processes is the ML method, which could be an alternative approach for evaluating students’ performance in high-stakes examinations. The ML provides opportunities for predicting student performance based on student characteristics and progress throughout the study. Moreover, the ML can provide the administration with helpful information to strengthen supportive interventions.15 Numerous ML models are presented, including traditional (logistic regression) and modern models (such as Bagging,16 AdaBoost,17 Random Forest18 and Extreme Gradient Boosting19). In addition, recent ML trends, such as deep learning (DL), will be considered.20 All these models, with different settings, can be used to predict students’ achievements on various exams.
According to the above definitions and motivations, this Best Evidence Medical Education (BEME) review aims to explore, analyse and synthesise the proposed ML models to predict the results of medical sciences students on high-stakes examinations. This systematic review will evaluate and compare various aspects of these modelling approaches to identify the ML models and settings with the best predictive performance.
Objectives and questions
The most common model in evidence-based studies, the PICO framework, breaks research questions (RQs) into searchable components and has four elements: population/problem, intervention, comparison and outcome.21 22 Figure 1 indicates the PICO components of our study.
This BEME review primarily aims to explore, analyse and synthesise evidence using ML techniques to predict the medical sciences students’ performance on high-stakes examinations based on their learning outcomes, such as class standings (eg, ranks) and achievement scores (eg, grades). The following are the specific objectives.
Investigate the ML approaches developed to predict medical science students’ performance on high-stakes examinations.
Specify the most appropriate sample size, essential features, practical ML methods, settings of models and best evaluation metrics.
Compare the existing ML models’ performance.
Recognise the challenges and limitations of ML models for predicting students’ performance in high-stakes examinations.
Highlight future research areas based on learning outcomes and ML models to predict students’ performance on high-stakes examinations.
Consider whether high-stakes examinations could be replaced with the proposed ML models for assessing students’ performance.
The following key RQ are determined based on our main objectives:
RQ1: what ML models are being devised to predict medical science students’ performance on high-stakes examinations based on their learning outcomes?
RQ2: what datasets are used for training, validating, and testing ML models?
RQ3: what are the appropriate sample size and dominant features?
RQ4: which preprocessing methods, ML models and settings can achieve the highest performance?
RQ5: what are the primary evaluation and success metrics?
RQ6: to what extent is the prediction of high-stakes examination results based on learning outcomes used as a substitute for these exams?
This systematic review will follow the rigorous methodology outlined in the Cochrane Handbook. The results will be reported per Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. The completed PRISMA-P Checklist is available as online supplemental additional file 1.
An interdisciplinary team of scholars expert in Medical Education, Medical Informatics, Information Technology, Medical Library and Information Sciences develops the current protocol. Figure 2 depicts the main steps to consider in our systematic review.
The investigation’s key terms are directly linked to PICO elements. Keywords and topics have been provided in consultation with team members. The selected keywords will be combined as presented in table 1.
In addition, we ran a preliminary PubMed search to help us decide on the final search strategy and keywords. An academic medical sciences librarian with over 15 years of experience working at the University of Medical Sciences has developed the search strategy (online supplemental additional file 2). The search string syntax will be trailed multiple times and slightly modified in each database to obtain all relevant results.
According to preliminary search results, various studies used high-stakes examination keywords, such as high-stakes tests and licensure targeting diverse populations (eg, undergraduate and postgraduate students), conducted in different contexts of medical sciences (eg, medicine and nursing). Moreover, the studies differed regarding input features, target features, preprocessing methods, ML models and evaluation metrics. The ML models have been used for various purposes, including student performance prediction, automated scoring and predicting dropout risk. All information presented above will be used to develop our final search strategy and incorporated into the data extraction and analysis steps.
Study inclusion and exclusion criteria
We will conduct the searches concentrating on studies published within one decade, from January 2013 to June 2023. This systematic review will include all studies using ML-based prediction of medical science students’ performance on high-stakes examinations based on their learning outcomes. Two team members will independently evaluate the extracted studies according to the inclusion and exclusion criteria. A third reviewer will be consulted on any remaining conflicts. Finally, the consistency between reviewers will be checked by calculating the kappa coefficient for each item. Table 2 summarises the prominent inclusion and exclusion criteria.
Search sources and strategies
Two approaches will be used to ensure the comprehensiveness of our search: (1) the electronic bibliographic databases, including MEDLINE/PubMed, EMBASE, SCOPUS and Web of Science, will be explored, and (2) in search of any additional relevant studies, forward and backward searches will be performed by checking the references and citations.
Procedure for extracting data
In the first step of the screening process, studies will be chosen based on their title, abstract and inclusion and exclusion criteria. The second step will be to review studies based on their full text. EndNote software will also be used to manage the references and remove duplicates.
A data extraction form is designed to extract data from the selected primary studies. This form contains the general study data and the details of the ML approach, such as input features, settings of models and evaluation metrics. Before data extraction, the reliability of the proposed form will be evaluated by two reviewers using 10 randomly selected articles. Table 3 presents the preliminary version of the data extraction form, which will be completed according to the information available in various articles.
Data will be extracted independently by two coders. Given any disagreement, the coders will be asked to discuss the issue. If the problem remains unresolved, a third reviewer will be consulted. Finally, the consistency of coders will be assessed by calculating the kappa coefficient for each article. Papers with kappa values less than 0.7 will be revised or rechecked.
Study quality assessment
The BEME quality framework23 24 will guide the team members in evaluating the methodological quality of studies (online supplemental additional file 3) with 11 indicators. Each indicator will be rated as ‘met’, ‘unmet’ or ‘unclear’. Studies must meet a minimum of seven indicators to be considered high quality. Two team members will carry out this quality assessment independently. It is important to note that no study will be excluded due to methodology flaws, and their quality will be discussed later.
Synthesis of extracted evidence
The following steps will be taken to answer our RQs:
Determine the characteristics, settings and context of the studies included.
Describe the included studies’ characteristics, settings and context.
Synthesise the findings to discuss the ML utility in predicting medical science students’ performance on high-stakes examinations to answer the review questions.
Furthermore, we will present and discuss our findings in five areas: (1) applying various ML models and settings, (2) the dominant features, (3) the evaluation metrics, (4) comparing the performance of ML models and (5) the possibility of becoming a replacement for the high-stakes examinations.
A meta-analysis might not be appropriate due to study design and methodology variations. However, we will use I2 and χ2 to test the statistical heterogeneity and investigate it visually through a forest plot. If the findings can be quantitatively synthesised, we will use the random-effects model and subgroup analysis to assess the study’s quality. Otherwise, we will synthesise the study findings and explain differences in evidence if heterogeneity exists. The funnel plot and Begg’s or Egger’s tests will be applied to check the publication bias for relevant outcomes.
Discussion and conclusions
High-stakes exams are one of the universal methods for medical licensing assessment. These examinations have been designed to evaluate medical sciences students’ knowledge, skill, ethics and professionalism. However, concerns have been raised about some unintended consequences. For instance, participation in these tests takes considerable time and effort and can be costly for trainees, organisations and universities.2 These exams can also increase stress levels among trainees and put their mental health at risk. Furthermore, single-exam performance may not fully reflect students’ competence during medical training and later as medical practitioners. Therefore, new techniques such as EDM and ML can effectively overcome existing methods’ limitations.
Despite the growing number of studies using ML models to predict students’ performance on the high-stakes exam, a systematic review is lacking to evaluate the effectiveness of these models and summarise the best and most accurate settings.
In this study, the authors will use a systematic approach to evaluate ML utility for predicting medical science students’ performance on high-stakes examinations. It will provide a fundamental understanding of the proposed ML models and outline complementary information for improving the ML methods. We will attempt to present the most dominant features, appropriate sample size, preprocessing techniques, ML models (classifications, regressions and DLs), models’ settings, evaluation metrics and success criteria by comparing the available models.
We anticipate this review will encourage policymakers and stakeholders to use ML models as a substitute for high-stakes examinations or as a part of the evaluation system in medical education programmes. Furthermore, analysing ML models’ challenges and limitations can aid in identifying knowledge gaps in this field and suggest new areas for future research.
A limitation of our review is only focusing on students’ learning outcomes, such as class standings rank and achievement status or scores, in predicting performance in high-stakes examinations. Therefore, further studies are suggested on using other approaches, such as university admissions tests and entrance exams, for applying ML models to predict students’ performance in high-stakes exams.
Ethics and dissemination
This systematic review protocol summarises findings from existing publications rather than primary data and does not require an ethics review. The results will be disseminated through peer-reviewed journals.
Patient and public involvement
Patients or the public were not involved in our research’s design, conduct, reporting or dissemination plans.
Patient consent for publication
We thank the National Agency for Strategic Research in Medical Education for funding this project. In addition, the authors appreciate Mashhad University of Medical Sciences for supporting us.
Contributors Study concept and design: HM and TD. Development of the search strategy: MZ and MJ. Drafting of the manuscript: HM, TD, MJ, EM and MZ. Critical revision of the manuscript for important intellectual content: SE. Study supervision: HM, TD and SE. Guarantor of the review: TD. All authors contributed to revising and approving the manuscript.
Funding The National Agency for Strategic Research in Medical Education in Iran (grant number 994079 awarded to HM as the principal investigator) funded this project.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.