Educational Limitations of ChatGPT in Neurosurgery Board Preparation

Objective This study evaluated the potential of Chat Generative Pre-trained Transformer (ChatGPT) as an educational tool for neurosurgery residents preparing for the American Board of Neurological Surgery (ABNS) primary examination. Methods Non-imaging questions from the Congress of Neurological Surgeons (CNS) Self-Assessment in Neurological Surgery (SANS) online question bank were input into ChatGPT. Accuracy was evaluated and compared to human performance across subcategories. To quantify ChatGPT’s educational potential, the concordance and insight of explanations were assessed by multiple neurosurgical faculty. Associations among these metrics as well as question length were evaluated. Results ChatGPT had an accuracy of 50.4% (1,068/2,120), with the highest and lowest accuracies in the pharmacology (81.2%, 13/16) and vascular (32.9%, 91/277) subcategories, respectively. ChatGPT performed worse than humans overall, as well as in the functional, other, peripheral, radiology, spine, trauma, tumor, and vascular subcategories. There were no subjects in which ChatGPT performed better than humans and its accuracy was below that required to pass the exam. The mean concordance was 93.4% (198/212) and the mean insight score was 2.7. Accuracy was negatively associated with question length (R2=0.29, p=0.03) but positively associated with both concordance (p<0.001, q<0.001) and insight (p<0.001, q<0.001). Conclusions The current study provides the largest and most comprehensive assessment of the accuracy and explanatory quality of ChatGPT in answering ABNS primary exam questions. The findings demonstrate shortcomings regarding ChatGPT’s ability to pass, let alone teach, the neurosurgical boards.


Introduction
Evaluating neurosurgical standards of care necessitates rigorous examination to uphold the integrity and quality of patient treatment and safety [1].The American Board of Neurological Surgery (ABNS) is a critical entity, which oversees the certification of neurosurgeons in the United States.One of the components of this certification process is the ABNS primary exam, a written exam that assesses the fundamental knowledge and skills necessary for a resident physician to proceed with neurosurgery training.It emphasizes clinical knowledge, judgment, and decision-making in the context of neuroanatomy, neuropathology, clinical neurology, neuroradiology, neurocritical care, and neurosurgical techniques [2].This annual exam functions as a milestone in the pathway to becoming board-certified and has evolved over time to enhance its objectivity, validity, and relevance to contemporary neurosurgical practice [1,3].
In the past year, advances in artificial intelligence, specifically natural language processing, have led to the emergence of state-of-the-art large language models, which are able to perform a variety of text-based tasks, including solving math problems, taking standardized tests, coding, and even writing poetry [4][5][6][7].The release of perhaps the most well-known of these models, the Chat Generative Pre-trained Transformer (ChatGPT) (OpenAI, 2022), sent the medical community into a frenetic search for ways in which to incorporate this innovative technology [8].
There has been great interest in benchmarking the level of medical "understanding" of large language models, resulting in the publication of numerous studies evaluating ChatGPT's performance on the United States Medical Licensing Examination and many medical subspecialty exams [9][10][11][12][13][14][15][16].
Additionally, ChatGPT has shown potential in explaining medical reasoning and analyzing clinical cases, introducing unprecedented possibilities for medical education [10,17].The use of ChatGPT in the context of neurosurgical education, however, is still a nascent field.A recent systematic review of ChatGPT's potential as an educational tool raised both ethical and practical concerns [18].While a couple of initial studies have investigated the performance of ChatGPT on practice neurosurgery board questions, these studies use only a fraction of available questions and do not explore ChatGPT's ability to explain its reasoning, a critical component of evaluating both its potential use as an educational tool [19,20].
The current study provides the largest and most comprehensive assessment of the accuracy and explanatory quality of ChatGPT in answering ABNS primary exam questions.The findings demonstrate shortcomings regarding ChatGPT's ability to pass, let alone teach, the neurosurgical boards.

Data collection
The Congress of Neurological Surgeons (CNS) Self-Assessment in Neurological Surgery (SANS) online question bank, parts one through four, was used for this study.This question bank is comprised of questions that were previously used in the ABNS primary exam and are divided into subjects, including American College of Graduate Medical Education (ACGME), anatomy, functional, fundamentals, neurobiology, other, pain, pathology, pediatrics, peripheral, pharmacology, radiology, spine, statistics, trauma, tumor, and vascular.For each subject, aggregate human accuracy statistics were obtained from the SANS website; however, individual data were not publicly available.
Questions that included imaging as well as those with a number of answer choices other than five were removed.The remaining questions were input into ChatGPT with the preceding prompt "Select the single best answer to the following multiple-choice question."The response was recorded and compared with the correct answer for accuracy.
ChatGPT was then prompted to give an explanation with "Why did you choose that answer?"A 10% (212/2,120) subject-stratified random sample of explanations was evaluated for concordance and insight as described previously in the literature [10].Both metrics serve as proxies for language model "understanding" and may also serve as indicators of educational potential.A concordant explanation is defined as one that is not self-contradictory, a highly desirable educational quality, as self-contradictory explanations cause confusion and necessarily include false information.Insight is defined as a factually true statement that does not simply define a term in the question, requires deduction or information not listed in the question, and is distinct from other insights in the explanation.The quantity of insights may serve as a measure of the number of potential learning opportunities provided by the explanation.For each sampled explanation, two neurosurgical faculty independently evaluated whether it was concordant or not and counted the number of insights.A senior, board-certified neurosurgeon arbitrated any score mismatches.
This study also explored the role of prompt length on ChatGPT's accuracy.Language models interpret prompts as sequences of standardized "tokens," which are words, word fragments, or individual symbols [21].
To calculate question length, each question was input into a tokenizer, which separates prompts into tokens, similar to how ChatGPT parses prompts [22].
This study did not require institutional review board approval, as there were no human or animal study participants.Additionally, this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Statistical analysis
ChatGPT and human performance were compared using chi-squared tests.Question length and accuracy were compared for each subject, with the overall association evaluated by subject-stratified linear regression.The inter-rater reliability of concordance and insight scores was evaluated using Cohen's kappa coefficient.Concordance and insight were compared between correctly and incorrectly answered questions using Mann-Whitney U tests.Correction for multiple hypothesis testing was performed for each set of tests using the Benjamini-Hochberg procedure with a predetermined false discovery rate of 0.05.Adjusted pvalues are denoted as q-values.All analyses were implemented using R (v4.2.1; R Core Team, 2022).

Results
Of an initial 2,816 questions, 690 (24.5%) involved imaging and were discarded.A further six (0.2%) questions had a number of answer choices other than five and were removed, leaving a total of 2,120 (75.3%) questions for input into ChatGPT.The plurality of questions answered by ChatGPT (13.1%, 277/2,120) and humans (12.6%, 355/2,816) were both in the vascular subcategory.

FIGURE 2: ChatGPT's Accuracy Compared to Question Length, Stratified by Subject
ChatGPT's percent accuracy is compared to token length and stratified across subjects, which are denoted by points that are proportional in size to the number of questions ChatGPT answered within those subjects.The black dotted line represents the subject-stratified, linearly approximated relationship between question length and ChatGPT's percent accuracy, while the gray-shaded region represents the 95% confidence interval of this linearly approximated relationship.
ChatGPT: Chat Generative Pre-trained Transformer A total of 212 explanations were randomly selected with subject-proportionate representation for independent evaluation by two neurosurgical attendings.Only eight (3.8%) responses required arbitration by a third neurosurgical attending due to differences in assigned concordance (κ=0.93) or insight (κ=0.96)scores.
The rate of concordance was 93.4% (198/212), while the mean insight score was 2.7.Correctly answering a question demonstrated a significant positive association with both concordance (p<0.001,q<0.001) and insight (p<0.001,q<0.001) (  Mean concordance percentages and mean insight scores, each with 95% confidence intervals, are shown overall as well as for correctly and incorrectly answered questions.Significant differences are shown with p-and respective corrected q-values.

Accuracy
When assessed on more than 2,000 questions using all parts of the CNS SANS question bank, ChatGPT achieved a fairly unimpressive overall accuracy of 50.4% (1,068/2,120).Our findings corroborate those of Hopkins et al., who found a similar accuracy of 54.9% (262/477) using non-imaging questions from another question bank [19].In contrast, Ali et al. reported a much higher accuracy of 73.4% (367/500) using both imaging and non-imaging questions from part one of the CNS SANS question bank [20].Notably, part one of the question bank has disproportionately few vascular questions, which we found was the subject in which ChatGPT had the worst performance; however, other factors such as the version of ChatGPT used and prompt engineering likely also play a role in the difference in reported performance.High sensitivity to the specific wording of input questions reveals a key limitation in ChatGPT's educational utility, as learners may not always pose questions in a way that triggers an optimal response.
The accuracy required to pass the neurosurgery board exam in 2023 was 72%.The overall mean human accuracy of 70.3% (counts unavailable) across all parts of the question bank was similar to this cutoff and significantly higher than that of ChatGPT.This average human performance includes those of more junior residents who may be several years from taking the board exam for credit.Additionally, residents use the question bank as a study tool, meaning that the expected level of performance at the end of using the question bank and at the time of the actual exam would be higher than the average performance measured when using the question bank.In contrast, we would not expect ChatGPT to perform differently on the real board exam.Beyond this performance gap, the version of ChatGPT used in this study cannot interpret images, which are present in nearly a quarter of the prompts in the CNS SANS question bank and represent a critical component of both the real exam and neurosurgical practice.
In addition to performing worse overall, ChatGPT also performed significantly worse than humans in the functional, other, peripheral, radiology, spine, trauma, tumor, and vascular categories.ChatGPT did not perform significantly differently than humans in the ACGME, fundamentals, pain, pharmacology, and statistics categories, as each of these subjects had fewer than 20 questions, providing minimal statistical power to detect such differences.ChatGPT also did not perform significantly worse in the anatomy, neurobiology, pathology, or pediatrics categories, possibly because information regarding these subjects was better represented in its training data.
Corroborating previous reports, subjects with longer average question lengths were generally those on which ChatGPT performed worse [20].As with humans, ChatGPT's differential performance across subjects may reveal differences in the ability to handle subject-specific complexity.Differences in question length across subjects may also in part reflect stylistic differences among question writers.

Explanatory quality
A recent report by Mannam et al. introduced a novel scoring system for explanatory quality when answering neurosurgery board questions; however, the significance and reproducibility of these scores remain unclear given their recent publication [23].Concordance and insight are metrics, which have been used previously to assess the explanatory quality of large language models [10].The high inter-rater reliability of these scores demonstrates their reproducibility in the current study.The positive association found between correctly answering a question and both concordance and insight indicate that ChatGPT has a better understanding of correctly answered questions.The number of insights provides an important benchmark against which to compare the educational value of future language models.
Although concordance is generally a desirable quality, even incorrectly answered questions were associated with high concordance (87.1%, 88/101) in this study.The lack of self-contradiction in a concordant explanation may seem "more confident," illustrating ChatGPT's potential to mislead trainees who do not know the correct answer with a superficially confident, but ultimately incorrect, explanation.

Strengths and limitations
The current study faces several limitations.The exclusion of imaging questions reduces the generalizability of the results due to the difference in complexity from the real board exam.The smaller sample of explanations evaluated for concordance and insight scores also precludes analysis of subject-specific educational quality.Finally, different results may be obtained by using different ChatGPT versions, alternative prompt engineering, or other large language models such as GPT-4 (OpenAI, 2023) or Bard (Google, 2023) [5,24].
Despite these limitations, this study contributes significantly to existing literature on this topic.The inclusion of all parts of the CNS SANS question bank more than quadruples the sample sizes of previous studies, yielding more precise performance metrics across a wider variety of questions as well as allowing for more granular, subject-specific analyses.Using token, rather than word, counts more accurately reflects ChatGPT's representation of question length.Most notably, the evaluation of explanatory quality through concordance and insight provides a deeper understanding of ChatGPT's capabilities and shortcomings as an educational tool beyond assessments of accuracy alone.This process was conducted by corroborating scores of multiple neurosurgical faculty to ensure consistency and mitigate bias.Finally, the use of non-parametric statistical tests and correction for multiple hypothesis testing minimizes the likelihood of spurious positive findings [25].

Conclusions
While future improvements will undoubtedly, and perhaps sooner than we expect, allow successors of ChatGPT to assist in neurosurgical education, present models suffer from substantial deficiencies.The current study provides the most extensive analysis to date of ChatGPT's accuracy on ABNS primary exam questions as well as provides an exploration of its explanatory quality, a key facet in the evaluation of its ability to educate neurosurgical trainees.
Future research will likely involve multimodal models that can synthesize multiple types of input data, such as imaging and text, to simulate more closely the clinical reasoning performed by neurosurgeons.Extending our evaluation of explanatory quality to a larger sample of questions can reveal the neurosurgical subcategories in which ChatGPT would most benefit from interacting with additional training material.
Although much excitement surrounds recent advancements in artificial intelligence, particularly with respect to large language models, studies such as ours underscore the current gaps in both accuracy and explanatory quality within highly specialized domains such as neurosurgery that limit their educational utility.

FIGURE 1 :
FIGURE 1: Comparison of ChatGPT and Human Accuracy, Stratified by Subject ChatGPT (red) and mean human (blue) percent accuracy are compared across subjects, with the error bars representing 95% confidence intervals.The dotted green line represents the minimum passing percentage in 2023 (72%).The dotted black line represents the expected percent performance obtained through random selection (20%).

TABLE 1 : Comparison of ChatGPT and Human Accuracy, Stratified by Subject
*Percentages may not sum to 100 due to rounding.†Significant without correction for multiple hypothesis testing (p<0.05).‡Significant following correction for multiple hypothesis testing (q<0.05).Subject distributions are represented by question counts with corresponding percentages.Mean accuracy is represented by percentages with 95% confidence intervals.Significant differences are represented by p-and respective corrected q-values.ACGME: American College of Graduate Medical Education, ChatGPT: Chat Generative Pre-trained Transformer

TABLE 2 : Explanatory Quality, Stratified by Accuracy
*Significant without correction for multiple hypothesis testing (p<0.05).†Significant following correction for multiple hypothesis testing (q<0.05).