Multivariate Modeling of Student Performance on NBME Subject Exams

Aim This study sought to determine whether it was possible to develop statistical models which could be used to accurately correlate student performance on clinical subject exams based on their National Board of Medical Examiner (NBME) self-assessment performance and other variables, described below, as such tools are not currently available. Methods Students at a large public medical school were provided fee vouchers for NBME self-assessments before clinical subject exams. Multivariate regression models were then developed based on how self-assessment performance correlated to student success on the subsequent subject exam (Medicine, Surgery, Family Medicine, Obstetrics-Gynecology, Pediatrics, and Psychiatry) while controlling for the proximity of the self-assessment to the exam, USMLE Step 1 score, and the academic quarter. Results The variables analyzed satisfied the requirements of linear regression. The correlation strength of individual variables and overall models varied by discipline and outcome (equated percent correct or percentile, Model R2 Range: 0.1799-0.4915). All models showed statistical significance on the Omnibus F-test (p<0.001). Conclusion The correlation coefficients demonstrate that these models have weak to moderate predictive value, dependent on the clinical subject, in predicting student performance; however, this varies widely based on the subject exam in question. The next step is to utilize these models to identify struggling students to determine if their use reduces failure rates and to further improve model accuracy by controlling for additional variables.


Introduction
The National Board of Medical Examiners' (NBME) subject exams remain frequent tools in determining clerkship grades and differentiating among students within an institution [1,2]. Concerns have been raised that the transition to Pass/Fail grading of the United States Medical Licensing Exams (USMLE) Step 1 will emphasize clerkship grades and the USMLE Step 2 exam as key differentiators in residency selection [3][4][5][6].
With the continued use of graded assessments in medical education, educators must consider how to identify and then support students who encounter academic difficulties. The use of methods such as tutoring, improved access to neuropsychological evaluation, and USMLE postponements have all been cited as ways to help students when they are falling behind their peer group [7,8]. However, there is no consensus as to how to identify which students may be struggling and benefit from these services. With the emphasis of subject exam scores on course grading, many institutions (including our own) use self-assessment cutoffs that have not been previously validated to identify struggling students. To better predict student success on other forms of assessments (i.e. licensing exams), some groups have turned to complex data modeling, including progress testing, Medical College Admission Test (MCAT) scores, and more, as evidence-driven means of predicting student scores and proactively engaging with struggling students [9][10][11]. This approach and the use of predictive analytics are complicated by the fact the NBME states that their self-assessments are not predictive tools and do not provide sufficient guidance on score interpretation [12]. As part of their subject exam preparation, students are encouraged to take at least one self-assessment during each clerkship. Faculty specify, based on the specific clerkship, when these should be completed (ranging from 2-4 weeks prior to the subject exam); however, students ultimately take the assessment at their convenience. When students used the vouchers funded by the school to take the practice exam, the scores were transmitted to the students and faculty in the SOM's Office of Academic Excellence (OAE). The OAE is an academic support office that both remediates students experiencing academic difficulties, and proactively gives students information and resources to succeed in medical school. In this case, OAE faculty review students' self-assessment scores to determine if academic interventions, such as coaching, working with a student tutor, or delaying the subject exams to allow more time to study are needed.
The self-assessments and subsequent subject exam performance data were analyzed using STATA BE Version 17.0 (StataCorp LLC, College Station, TX) to create a multivariable linear regression model. Student success on NBME subject exams (determined by the equated percent of questions correct or percentile on the subject exam) was the predicted variable of these models. Variables considered in the model include self-assessment score, how many days prior to the subject exam the self-assessment was actually taken, the quarter in which the self-assessment and subject exams were taken, and prior performance on the USMLE Step 1. Each model was evaluated to confirm that it did not grossly violate the assumptions of multivariate regression, including normality, linearity, homoscedasticity, and lack of collinearity.
Data were collected and analyzed using multivariate linear regression modeling. Different models were created for each of the six NBME subject exams to determine which model best accounts for the variability in student performance based on the four variables above and using the omnibus F-test as a marker of statistical significance for the model.

Results
The regression model was developed based on subject exam scores and the four variables: self-assessment score, how many days before the subject exam the self-assessment was taken, the quarter in which the exams were taken, and score on the USMLE Step 1.
When considering univariate regression models based on the variables across disciplines, the selfassessment score (R 2 Range 0.1072-0.3522) and USMLE Step 1 score (R 2 Range 0.1682-0.2965) were most predictive of the final subject exam scores. These two variables were consistently statistically significant (p < 0.05). Further, additional variables in the linear models were intermittently significant. The quarter was disregarded as a variable from models where subject exam percentile was the outcome because the NBME norms percentiles are based on academic quarters. An example of the multivariate graphs produced by the models can be seen in Figure 1. Figure 5 of the supplemental digital material.

Discussion
While both percentile and equated percent correct outcomes were statistically significant using multivariate regression, the variables in equated percent correct models accounted for more variability than in percentile models, as seen by the R 2 ranges (R 2 0.1072-0.3522). While many programs, including UNC SOM, may use percentile cutoffs to determine academic advancement, these scores are normed by the NBME annually [13]. For this reason, the equated percent correct models are favored because they are statistically more robust and remain easily interpretable despite changes in national norms. Although it is not perfect, this study showed that regression modeling of self-assessment scores can be viewed as a helpful tool to help predict subject exam scores in a way that has not previously been described.
The prediction slightly improves when the Step 1 score is incorporated into the linear model. Given preexisting data on standardized test scores potentially predicting future standardized test performance [11], the predictive nature of Step 1 was not an entirely surprising finding. However, these authors acknowledge concerns that the USMLE Step 1 is a potentially biased tool [14] and that some students who scored lower on USMLE Step 1 are often coached to achieve a passing score.
Step 1 scores are no longer a viable metric for most students given the transition to pass-fail grading [3]; however as shown, the inclusion of additional variables may improve the model further.
Following the development of statistical models to predict student success, the logical conclusion is to apply these data to students actively preparing for the subject exams to help determine which students faculty can approach proactively with academic support structures like tutoring or coaching. In doing so, these statistical models must demonstrate that their predictive value exceeds that of more conventional, subjective means. Following verification, these data and models can be used to drive decisions regarding individual students based on their predicted scores and to drive institutional decisions regarding policies of academic advancement and standardized testing delays for students considered at high risk of subject exam failure.
It is also important to note that the interpretation of these models has significant limitations. In addition to the previously mentioned variables that are difficult to account for in a simple linear regression model, these models are based on a single-center, retrospective study. Additional data collection from other institutions may help improve the validity of the models by including a larger cohort of students. One of the most predictive variables, performance on the USMLE Step 1 Exam, is also no longer available as a metric given the change in scoring to a pass-fail system in 2022 [15] and therefore cannot be used in applying some of the specific models described herein. There are also other variables, such as the availability of study time in each clerkship, that may be important to student performance but are not easily quantifiable and may not meet the requirements of linear regression (i.e. homoscedasticity). Medical educators will need to think critically about other metrics to help identify students who may need support and/or who could be at risk of future academic difficulty.
Linear regression is also limited in the scope of variables that can be included in the model. Variables that could impact the reliability of the models, such as self-assessment form version and more abstract life circumstances, are particularly difficult to account for in a statistical model. More robust modeling methods such as neural network models and other machine learning models could help further improve the accuracy of predicted outcomes.

Conclusions
Given the high stakes nature of subject exams, programs should adopt a mechanism by which they can leverage available data to predict which students are at highest risk of failure. This study showed that multivariate regression, while imperfect, can serve as a potential tool to predict student performance. At the same time, it is important to recognize and consider the limitations of these models in the context of each individual student's academic and career advising. Using statistical analysis, institutions may be able to create simple yet powerful predictive tools to help guide conversations around student support and guide national conversations regarding grading, student advancement, and other policy initiatives.

Additional Information
Disclosures