Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level

Introduction Artificial intelligence (AI) models using large language models (LLMs) and non-specific domains have gained attention for their innovative information processing. As AI advances, it's essential to regularly evaluate these tools' competency to maintain high standards, prevent errors or biases, and avoid flawed reasoning or misinformation that could harm patients or spread inaccuracies. Our study aimed to determine the performance of Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI and Google BARD (BARD) in orthopedic surgery, assess performance based on question types, contrast performance between different AIs and compare AI performance to orthopedic residents. Methods We administered ChatGPT and BARD 757 Orthopedic In-Training Examination (OITE) questions. After excluding image-related questions, the AIs answered 390 multiple choice questions, all categorized within 10 sub-specialties (basic science, trauma, sports medicine, spine, hip and knee, pediatrics, oncology, shoulder and elbow, hand, and food and ankle) and three taxonomy classes (recall, interpretation, and application of knowledge). Statistical analysis was performed to analyze the number of questions answered correctly by each AI model, the performance returned by each AI model within the categorized question sub-specialty designation, and the performance of each AI model in comparison to the results returned by orthopedic residents classified by their respective post-graduate year (PGY) level. Results BARD answered more overall questions correctly (58% vs 54%, p<0.001). ChatGPT performed better in sports medicine and basic science and worse in hand surgery, while BARD performed better in basic science (p<0.05). The AIs performed better in recall questions compared to the application of knowledge (p<0.05). Based on previous data, it ranked in the 42nd-96th percentile for post-graduate year ones (PGY1s), 27th-58th for PGY2s, 3rd-29th for PGY3s, 1st-21st for PGY4s, and 1st-17th for PGY5s. Discussion ChatGPT excelled in sports medicine but fell short in hand surgery, while both AIs performed well in the basic science sub-specialty but performed poorly in the application of knowledge-based taxonomy questions. BARD performed better than ChatGPT overall. Although the AI reached the second-year PGY orthopedic resident level, it fell short of passing the American Board of Orthopedic Surgery (ABOS). Its strengths in recall-based inquiries highlight its potential as an orthopedic learning and educational tool.


Introduction
In recent years, the fields of machine learning, deep learning, and artificial intelligence (AI) have seen exceptional growth, revolutionizing various sectors such as manufacturing, consumer products, and healthcare.Neural networks, in particular, have advanced the detection of fractures and orthopedic implants, among other medical applications [1][2][3][4][5][6].Nevertheless, these AI systems are often domain-specific, requiring significant time, resources, and specialized data for their respective fields, which limits their broad applicability and versatility.
amounts of data to generate responses that are more akin to natural language [7].Operating in nondomain-specific or few-shot contexts, they need minimal data to perform specific tasks.LLMs have the potential to understand, process, analyze, and reason through a diverse array of questions.Recently, two AI models named Chat Generative Pre-trained Transformer (ChatGPT) and BARD, which utilize LLMs in nonspecific domain areas, have attracted considerable attention.
Medical education and technology are experiencing a transformation with the emerging application of AI through computer-based models, virtual reality simulations, and tailored learning platforms [8,9].With the expanding capabilities of AI, it is imperative to consistently evaluate the competence of AI-powered tools, especially generative AI models that can generate flawed reasoning or misinformation.Recognizing the public availability of these tools and their utility to serve as additional informational aids makes verifying the accuracy of these tools crucial, especially within the field of orthopedic surgery, as mistakes could be detrimental to patients or lead to the spread of misinformation.
The premise of this study was to explore the percentage of Orthopedic In-Training Examination (OITE) questions the generative, pre-trained transformer chatbots, ChatGPT and BARD, could correctly answer.Second, the study investigated whether or not AI performance varied depending on the sub-specialty subject matter or the taxonomy of the questions (recall, interpretation, and application of knowledge).Third, the study compared the performance of both LLMs to one another.Lastly, the study investigated how the performance of the LLMs stood up against that of orthopedic residents at various training levels, particularly focusing on the likelihood of the LLMs to yield a passing score on the orthopedic surgery written boards, for which the benchmark of the 10th percentile for fifth-year residents is typically considered a passing score.

Materials And Methods
In this experimental study, we used commercially available LLMs named ChatGPT 3.5 (OpenAI, San Francisco, CA) and BARD (Alphabet Inc, Mountain View, CA), which incorporate self-attention mechanisms and a vast array of training data to produce natural language responses in conversational contexts.These models excel at managing long-range dependencies in text, resulting in coherent and contextually relevant responses.Self-attention mechanisms are critical in natural language processing tasks such as language translation and text generation, helping to discern the relationships between words or elements within sentences or entire documents.The synergy of long-range dependencies and self-attention enables the models to understand and generate accurate responses.ChatGPT 3.5 operates as a closed system, confined to a server without internet access, and relies on intrinsic word relationships within its neural network to generate responses.This differentiates it from other chatbots or domain-specific AI that utilize internetbased searches.Conversely, BARD operates similarly but is permitted internet access, potentially enhancing its informational reach.
We selected 757 questions from the actual OITE from the years 2015-2016 and 2022.The 2022 exam served as a benchmark since its questions and answers were not included in the training dataset.We excluded 48% (367 of 757) of the questions because they incorporated images, figures, tables, or charts, leaving 390 questions for BARD.Additionally, three questions were removed from ChatGPT's set due to the AI's inability to provide a definitive answer, resulting in 387 questions for ChatGPT.As ChatGPT is a text-only program, it cannot process questions with non-textual data such as images or figures.We entered each question into ChatGPT's interface in separate chat sessions to prevent any memory retention, which could occur through the LLM's recurrent neural network learning processes.
For evaluation, we entered each question into the chat session and requested the LLM to select an answer.If the LLM failed to choose a single answer or provided multiple answers, we re-prompted with "Select the single best answer."If the LLM still failed to select one, we recorded the question as "did not answer."ChatGPT struggled with 0.7% (three of 390) of the questions in providing a single best answer, which we then excluded.In contrast, BARD managed to respond to all its applicable questions.

Primary and secondary study outcomes
The primary goal was to determine the percentage of questions each LLM accurately answered.Secondary aims included a detailed comparison of performance across ten sub-specialties and three taxonomy classes of questions, benchmarking against orthopedic residents' training levels, and evaluating against a pass rate threshold for the American Board of Orthopedic Surgery (ABOS).
We employed the Buckwalter taxonomic schema to classify question difficulty levels [10].Among the 757 questions, 62% (242 of 390) were Tax I (recognition and recall), 13% (52 of 390) were Tax II (comprehension and interpretation), and 25% (96 of 390) were Tax III (application of knowledge).We used the mean and standard deviation of OITE scores by year and post-graduate year (PGY) level to compare the LLMs to orthopedic residents.This included analyzing mean scores, standard deviations, and calculated percentiles for each PGY level [11].We also assessed the likelihood of the LLMs passing the ABOS written exam based on a correlation between OITE scores in the 10th percentile and ABOS exam failure rates [12].

Ethical approval
Since the study did not include human or animal participants, ethics committee approval was not obtained.

Statistical analysis
We applied chi-squared tests to assess performance differences between the LLMs.ANOVA contrast deviation was used to evaluate the variance between each sub-specialty and the cohort average.We used Omnibus likelihood ratio tests to detect response correctness concerning question subject types and taxonomy classes.If differences were found, binomial logistic regression tests compared the correct and incorrect answers within these categories.Estimated marginal means were computed with 95% confidence intervals, and visual representations were created for both subject types and taxonomy classes.All statistical analyses were performed using Jamovi software version 2.3.21.0 (Sydney, Australia).

Percentage of OITE questions answered correctly
ChatGPT correctly answered 54% (210 out of 387) of the questions and incorrectly answered 46% (177 out of 387).Three questions received no response from the AI; these were excluded because they elicited multiple answers without a clear "single best answer."BARD correctly answered 58% (227 out of 390) of the questions and incorrectly answered 42% (163 out of 390), responding to all posed questions.

Performance in relation to sub-specialty knowledge
ChatGPT's performance varied by sub-specialty.It performed best in sports medicine (73%, 27 of 37) and worst in hand surgery (28%, nine of 32).ANOVA analysis showed sports medicine and basic science scores above average (p=0.006and p=0.009, respectively), while hand surgery was below average (p=0.007)(Tables 1, 2).BARD's performance was more consistent across sub-specialties, with the highest scores in basic science (71%, 83 of 117) and the lowest in oncology (29%, five of 17).ANOVA indicated basic science above average (p<0.001),with sports medicine, hand surgery, and oncology near average (p=0.076,p=0.057, and p=0.059, respectively) (Table 3).

Performance comparison with orthopedic residents
ChatGPT's performance ranked between the 42nd-95th percentile for PGY1s and between the 1st-7th for PGY5s across different OITE years.However, it likely would not pass the ABOS examination based on the PGY5 10th percentile benchmark (Table 6) [11,12].
BARD performed slightly better, ranking between the 61st-96th percentile for PGY1s and between the 1st-17th for PGY5s.Despite occasionally surpassing the 10th percentile mark, overall performance suggested it also would likely not pass the ABOS examination (Table 6) [11,12].

Performance comparison between ChatGPT and BARD
BARD had a higher percentage of correct answers than ChatGPT (58% vs 54%, p<0.001).

Discussion
AI has become increasingly prevalent in medicine over the past few years, with potential applications in education, interpretation, and information management expanding [4].Furthermore, it may ultimately enhance our precision in an array of sub-specialty diagnostics and therapeutics [13].As new AI tools are developed, it is essential to assess, evaluate, and update their competency.In our study, ChatGPT, an AI LLM chatbot, correctly answered 54% of the questions on modern OITE-style exams, and BARD answered 58% correctly.While this places both AIs within the average percentile for a second-year orthopedic resident, it is unlikely to pass the ABOS due to its performance below the 10th percentile of upper-level residents.This result may be attributed to the chatbots' limited ability to apply knowledge to higher taxonomic-level questions, suggesting a difficulty in utilizing their knowledge in practical ways.This suggests that the model may have limitations in terms of its ability to integrate, synthesize, generalize, and apply factual knowledge in more nuanced ways.Furthermore, the AI would likely struggle to pass the ABOS due to its inability to interpret and analyze image-based questions, which make up roughly half of the test questions.
There are likely practical benefits and applications of AI in this context.One advantage of AI is its ability to manage large volumes of data, which can be quickly accessed as knowledge by users.This study demonstrated that the AI LLM performed better in recognition, recall, comprehension, and interpretation tasks than in problem-solving and knowledge application.This lack of application of knowledge has been highlighted before in previous publications [14].Interestingly, in another study, this difference with regard to hierarchical question type was not seen with dermatology knowledge questions [15].Other research has shown opportunities for AI to use big data for insights and strategies in managing specific diseases, such as opioid use disorders [4].For example, Liu et al. found that AI and orthopedic surgeons had similar accuracy in identifying tibial plateau fractures [16].These applications could enhance efficiency and precision in diagnosis and treatment, ultimately improving patient outcomes.
AI can also make educational resources more accessible to patients.A recent study showed that ChatGPT successfully revised complex patient education materials on spine surgery and joint replacement, making them readable at fifth-to sixth-grade levels [17].Another study proposed that AI could enable educators to transition to mentorship roles by compiling the best learning strategies from top educators, allowing students to enhance their learning experiences independently and efficiently [18].Furthermore, AI can offer personalized learning experiences tailored to individual students' needs and abilities, potentially improving engagement and knowledge retention for more effective learning.However, more research is needed to determine the extent and degree of these benefits.
This study has several limitations, particularly the inability to incorporate visual identification, interpretation, and integration within the questions.Almost half of the questions contained images, figures, or charts, leading to their exclusion.The actual ABOS and OITE exams include images, and many aspects of musculoskeletal care necessitate interpreting and analyzing images, radiographs, and tactile feedback from physical examinations.The exclusion of image-based questions may have biased the results by potentially omitting more challenging or application-focused questions for the LLM.Moreover, the basic science subspecialty contained more recall-based questions, which could have inflated the LLM's performance in that area.
Although images play a crucial role in orthopedic surgery, this LLM relies solely on text input.While AI for image analysis is advancing rapidly, future iterations may be able to assess images.Nonetheless, this preliminary study of text-based questions was sufficient to reveal the LLM's capabilities and limitations in this context.General limitations of AI models include potential biases or inaccuracies in the datasets they are trained on, which can reflect or amplify existing societal biases or inequalities and may contain outdated information.
Lastly, limitations specific to this LLM stem from its training on broad, non-specific information.While it excels in summarization, translation, and text generation, it might struggle with context or nuanced language in specialized knowledge areas, leading to inaccurate or misleading responses.

Conclusions
Though ChatGPT and BARD might not pass the ABOS written exam at this point, they offered wellstructured explanations for correct answers, achieving results comparable to around the 50th percentile of PGY2 orthopedic residents.Furthermore, the model demonstrated learning capabilities when incorrect answers were corrected, as it retained and consistently applied the corrected information throughout the chat session.Overall, the ability to return well-structured, insightful explanations (to correctly answer questions) combined with demonstrated learning capabilities suggest AI's potential to support and enhance medical education and healthcare in the future.
The LLM exhibited strengths in recalling facts but faced challenges in applying knowledge.As AI technology advances, particularly in areas like image-based recognition, interpretation, and domain-specific knowledge application, it will be fascinating to observe the ongoing improvements in AI and explore its optimal application in orthopedic education.

TABLE 2 : Analysis of variance performed for ChatGPT sub-specialty question types. Contrasts show differences between variables and their group means, specifically that basic science and sports medicine performed better, while hand surgery performed worse than group averages (p<0.05). Pediatrics trended towards performing better (p=0.089).
BS: basic science, TR: trauma, SM: sports medicine, SP: spine, HK: hip and knee reconstruction, PE: pediatrics, OC: oncology, SE: shoulder and elbow, HA: hand surgery, FA: foot and ankle, AN: anatomy, SE: standard error, t: t-value for analysis of variance contrasts, ChatGPT: Chat Generative Pre-trained Transformer.

TABLE 6 : OITE individual and combined percentile ranking. This table presents the percentile rank for each post-graduate year (PGY). The OITE provides specific mean raw scores and standard deviations for each PGY, enabling the calculation of percentiles for OITE 2015, 2016, and 2022. Based on previous OITE years (2014-2017), a mean raw score and standard deviation can be applied to non-specific OITE questions, such as those from AAOS SAE and all combined questions in testing, as shown below.
OITE: orthopedic in-training examination, PGY: post-graduate year, AAOS: American Academy of Orthopedic Surgeons, SAE: self-assessment test.