Evaluating the Artificial Intelligence Performance Growth in Ophthalmic Knowledge

Objective: We aim to compare the capabilities of Chat Generative Pre-Trained Transformer (ChatGPT)-3.5 and ChatGPT-4.0 (OpenAI, San Francisco, CA, USA) in addressing multiple-choice ophthalmic case challenges. Methods and analysis: Both models’ accuracy was compared across different ophthalmology subspecialties using multiple-choice ophthalmic clinical cases provided by the American Academy of Ophthalmology (AAO) “Diagnosis This” questions. Additional analysis was based on image content, question difficulty, character length of models’ responses, and model’s alignment with responses from human respondents. χ2 test, Fisher’s exact test, Student’s t-test, and one-way analysis of variance (ANOVA) were conducted where appropriate, with p<0.05 considered significant. Results: GPT-4.0 significantly outperformed GPT-3.5 (75% versus 46%, p<0.01), with the most noticeable improvement in neuro-ophthalmology (100% versus 38%, p=0.03). While both models struggled with uveitis and refractive questions, GPT-4.0 excelled in other areas, such as pediatric questions (82%). In image-related questions, GPT-4.0 also displayed superior accuracy that trended toward significance (73% versus 46%, p=0.07). GPT-4.0 performed better with easier questions (93.8% (least difficult) versus 76.2% (middle) versus 53.3% (most), p=0.03) and generated more concise answers than GPT-3.5 (651.7±342.9 versus 1,112.9±328.8 characters, p<0.01). Moreover, GPT-4.0’s answers were more in line with those of AAO respondents (57.3% versus 41.4%, p<0.01), showing a strong correlation between its accuracy and the proportion of AAO respondents who selected GPT-4.0’s answer (ρ=0.713, p<0.01). Conclusion and relevance: Our study demonstrated that GPT-4.0 significantly outperforms GPT-3.5 in addressing ophthalmic case challenges, especially in neuro-ophthalmology, with improved accuracy even in image-related questions. These findings underscore the potential of advancing artificial intelligence (AI) models in enhancing ophthalmic diagnostics and medical education.


Introduction
The intersection of artificial intelligence (AI) and healthcare presents opportunities for improved clinical decision-making and medical education.The rising competency of AI in diagnosing diverse ocular conditions underscores its potential within ophthalmology [1,2].In this field, natural language processing (NLP) models, notably the Generative Pre-trained Transformer (GPT) series developed by OpenAI (San Francisco, CA, USA), have brought promising potential for language understanding and healthcare diagnostics [3,4].GPT models are trained on a textual database, and they learn to generate coherent and contextually appropriate responses based on the abstract relationship between words (tokens) within the neural network [5].Previous studies show that GPT-3.5 achieves over 50% accuracy on the United States Medical Licensing Examination (USMLE), nearing the 60% pass mark.Furthermore, over 90% of AI responses provided substantial insights, demonstrating deductive reasoning that could be beneficial for human learners [6].The GPT models' capabilities extend beyond general medical education.For example, when tested with practice ophthalmic board questions, ChatGPT was able to answer 46% of them correctly in January 2023 [7].
Since then, the GPT model has been updated from GPT-3.5 to GPT-4.0, showcasing potential enhancements in AI-enabled medical education.The new model developed better contextual understanding, enhanced topic consistency, and markedly increased accuracy as evidenced by its success on professional and academic assessments, improving from the 10th percentile to the 90th percentile on the Uniform Bar Exam [8].Recent studies utilizing the Basic and Clinical Science Course (BCSC) Self-Assessment Program for the Ophthalmic Knowledge Assessment Program (OKAP) examination have shown vast improvements in recent ChatGPT iterations that can match the accuracy rate of human respondents [9].Yet, the AI performance growth seen in the GPT models, particularly in non-board style ophthalmology clinical cases, remains empirically under-examined.Our study compares the capabilities of ChatGPT-3.5 and ChatGPT-4.0using ophthalmic case challenges.
Clinical ophthalmology questions were sourced from the 2022 release of the American Academy of Ophthalmology (AAO) "Diagnose This," a compendium of weekly clinical ophthalmology multiple-choice questions [10].We chose AAO "Diagnose This" due to its focus on clinical decision-making rather than conventional standardized testing.The 2022 release ensured the exclusion of any data indexed by GPT models trained before January 1, 2022.Because the study exclusively utilized public data and did not involve human participants, it adhered to the American Association for Public Opinion Research (AAPOR) reporting guidelines and did not require ethical clearance.
The primary objective of the study was to compare the accuracy of GPT-3.5 and GPT-4.0 across different subspecialties of ophthalmology.The secondary objectives included evaluating performance on imagerelated questions, varying question difficulty, character length of model responses, and alignment with responses from AAO respondents.
All 52 questions from the 2022 AAO "Diagnose This" were categorized based on the specified ophthalmology subspecialty: cornea/anterior segment, glaucoma, neuro-ophthalmology, ocular pathology/oncology, oculoplastic/orbit, pediatric, refractive, retina, and uveitis.Opening and utilizing a premium OpenAI account with no prior usage, we ensured zero preceding conversation history.Questions and corresponding choices were inputted directly into GPT-3.5 and GPT-4.0 without initial prompts.To maintain the integrity of responses, we cleared each prior ChatGPT conversation since the model's replies can be shaped by ongoing dialogues.Example input and outputs can be referenced in the Results.Responses were assessed for accuracy by checking the selected answer from GPT outputs with AAO's correct answer.Any ambiguity in model choices triggered a re-prompt, instructing them to "choose one of the following options." For the secondary analysis, we aimed to analyze ChatGPT's performance based on not only the subject but also the level of difficulty, type of question, and agreement with human respondents.As AAO cases lacked formal difficulty labels, we inferred based on the accuracy percentage of AAO respondents: least (≥80%), middle (79%-51%), and most difficult (≤50%).Questions referencing images were tagged as "image" cases.For these questions, GPT model inputs were aided with descriptions based on the image interpretations published in the AAO answer during an "image-aided" cycle.The "combination" runs combined raw text inputs for non-"image" queries with the "image-aided" input.Further analysis was conducted on the complexity of GPT model responses by examining the verbosity of the models' responses based on output character length.We sought to determine if there was any correlation between the output length and the inherent difficulty of the question.Moreover, we conducted a segment-based analysis, categorizing questions into those pertaining to the anterior segment of the eye and those related to the posterior segment.Additionally, we differentiated between purely textual questions and those that referenced or were based upon images.Lastly, to gauge the alignment of the GPT models with human responses, we quantified the proportion of AAO respondents' selections that matched the GPT models' choices.χ2 test, Fisher's exact test, Student's t-test, and one-way analysis of variance (ANOVA) were conducted where appropriate [11].The homogeneity of variance was assessed using Levene's statistic.Where absent, Welch's ANOVA was completed.Correlation was calculated using Spearman's coefficient.All statistical analyses were conducted using Statistical Package for the Social Sciences (SPSS) version 29 (IBM SPSS Statistics, Armonk, NY, USA), with p < 0.05 considered significant.Patients and the public were not involved in the design, conduct, reporting, or dissemination plans of the research.
Among the 30 (58%) image-related questions, GPT-4.0 demonstrated greater accuracy than GPT-3.5 (73% versus 46%, p=0.07).There was no notable difference between the two models in categorical performance.The performance gap between the models remained consistent when they were provided with image interpretations.GPT-3.5 improved its accuracy by 14%, while GPT-4.0improved by 16%.When presented with a combination of text-based questions and image questions with interpretations, GPT-4.0 still outperformed GPT-3.5 by a large margin (83% versus 54%, p<0.01) with a significant improvement for neuro-ophthalmology cases (p=0.03).Examples of the GPT models' inputs and outputs for the text-only and image questions can be found in Table 2.   Correct 77%

TABLE 2: Sample AAO question inputs and GPT model's output
Abbreviations: AAO: American Academy of Ophthalmology, GPT: Generative Pre-trained Transformer When the questions were categorized by difficulty according to the percentage of AAO respondents who answered each correctly, GPT-3.5 did not show any difference in accuracy across difficulty levels (p=0.30), while GPT-4.0performed better on easier questions relative to harder ones (94% (most difficult) versus 76% (middle) versus 53% (least), p=0.03) (Table 3).There was no significant difference between the two models in terms of accuracy or response length for anterior versus posterior segment questions or image versus textonly questions.Overall, GPT-4.0 generated shorter answers than GPT-3.5 (651.7±342.9versus 1,112.9±328.8characters, p<0.01).The official AAO explanations were longer than GPT-4.0'sanswers by 310.2±147.5 characters on average (p<0.01). 2023

TABLE 3: GPT model analysis based on response complexity (character length) and AAO respondent agreement
Complexity analysis was performed using character counts excluding spaces.AAO agreement was reported as % of AAO respondents who chose the same answer as the GPT model.

Discussion
Our study highlights significant advancements in the application of AI for ophthalmology, particularly in 2023 Jiao et al.Cureus 15(9): e45700.DOI 10.7759/cureus.457009 of 12 the performance metrics between GPT-4.0 and GPT-3.5.Specifically, GPT-4.0 significantly outperformed GPT-3.5 in overall accuracy and in the neuro-ophthalmology subspecialty, as detailed in Table 1.Equally important, GPT-4.0 showed strength in handling image-related questions, maintaining its advantage even when supplied with image interpretations, as shown in Table 3.
The effectiveness of NLP models in addressing AAO clinical scenarios is further evidenced by their improving track record on ophthalmic board practice questions [7,9,12].Mihalache et al.'s study highlighted a significant improvement in ChatGPT's performance on the OphthoQuestions question bank for OKAP examinations, increasing from 46% to 58% over a one-month period [7].Remarkably, within just a few months after the release of GPT-4.0, it achieved an average accuracy of 71.6% in ophthalmology, nearly matching the human respondents' accuracy of 72.2% in a dataset of 250 questions [9].The advancements in NLP model performance in specialized knowledge are further corroborated by a neurosurgery study that assessed GPT-4.0'sperformance on oral board examinations; here, it achieved an impressive accuracy rate of 82.6%, far exceeding GPT-3.5's62.4% [13].This upward trajectory underscores the dynamic evolution of NLP models, which are becoming robust and reliable in a relatively short period.Our study, covering ophthalmology clinical case questions, reaffirms this rapid advancement, emphasizing the speed at which conversational AI systems are mastering complex, specialized tasks.
OpenAI's release of GPT-4.0 brought several improvements.When applied to specialized domains such as ophthalmology, these advancements lead to increased accuracy.One of the most notable changes is the sheer scale.The original GPT model had 117 million parameters, while GPT-3.0 had 175 billion parameters, and GPT-4.0 boasts a staggering 170 trillion [14].The massive increase in model size allowed for the later models to store more information and recognize more nuanced patterns.Using more extensive and diverse datasets, including books, articles, and websites, the GPT models capture a broader spectrum of ophthalmic knowledge.Aside from the scale of training, newer models had more advanced training methodologies.With GPT-3.0,there was the introduction of few-shot learning.Unlike traditional machine learning models that require extensive labeled data for specific tasks, the most recent GPT systems can employ meta-learning to complete new tasks based on the pattern it identifies from a limited number of examples [15].A recent study has shown that GPT-4.0 increasingly excels at zero-shot learning and can perform tasks with no prior training [16].This process leverages semantic relationships between categories.For instance, if the model knows that a zebra is like a horse and has stripes, it will still be able to recognize one, even if it has never seen the animal.Although OpenAI explicitly mentioned in their GPT-4.0technical report that they would not disclose the exact internal architecture, potentially due to the growing competition in the AI field, they did provide some insights into the updates from their previous model.OpenAI indicates that GPT-4.0 introduced a rule-based reward model (RBRM), complementing the reinforcement learning with human feedback (RLHF) of GPT-3.5 [8].The RBRM approach enhances the fine-tuning process by ranking various system responses and providing a reward signal to the top output, based not only on user feedback but also on pre-set parameters.This ensures the language model's compliance with generating correct content and reducing "hallucinations," unwarranted confident answers [17,18].Furthermore, the context window length of GPT-4.0 is 4-6 times greater than GPT-3.5 [8].The context width refers to the number of previous tokens or words the model uses to formulate its response, allowing for greater relevance, coherence, and quality in its outputs.This enhanced contextual understanding boosts the neural pathways' capacity to detect intricate patterns, which can help interlink symptoms or systemic conditions.
Regarding the conciseness of GPT-4.0'sanswers, this attribute may have significant practical implications.
With the widespread use of practice questions among medical and proven associations between the number of completed practice questions and board examination performance [19], the ability to provide explanations to students gives NLP an applicable role in medical education.With comparatively shorter responses to both ChatGPT 3.5 and the AAO-provided explanation, ChatGPT 4.0's changes could translate into succinct feedback for students looking for abbreviated explanations.Moreover, more concise responses could indicate better computational efficiency in data processing, saving both time and resources [20].Furthermore, the increased alignment of GPT-4.0'sanswers with those of human respondents from the AAO is particularly promising.This greater congruence not only suggests that GPT-4.0 is becoming more attuned to medical consensus and clinical reasoning pathways [21].The enhanced alignment of GPT-4.0'sresponses with those of AAO human respondents suggests an increasing congruence with medical consensus and established clinical reasoning, raising the possibility that future versions of the GPT series could be even more closely aligned with clinical best practices.
While the rapid evolution of the GPT models offers substantial benefits for medical training, there are risks of misinformation.As highlighted in Table 2, GPT models are prone to "hallucinations," which could lead to misdiagnosis if not caught by a knowledgeable expert [9,17].These instances question the reliability of the model, particularly where precise clinical decisions are essential.Additionally, the differential performance of GPT-4.0 across varying levels of question difficulty illuminates nuanced challenges that constrain their utility in comprehending and responding to ophthalmology cases.While GPT-4.0 performs admirably on straightforward, single-step questions, its decreased efficacy with image or increasingly complex queries potentially points to limitations in multistep reasoning and inference, as noted by Cai et al. [9].The viability of GPT models in an inherently visual field such as ophthalmology is hampered by their inability to natively process image data [12].This is particularly concerning for real-world clinical applications, where medical decision-making often involves a labyrinth of interrelated variables, from patient histories and diagnostic imaging to multifaceted treatment algorithms.Furthermore, the model's struggle with image questions that require inference highlights its shortcomings for nuanced clinical settings, where physicians routinely make probabilistic decisions based on ambiguous visual data [20].While GPT-4.0 may serve as a resource for answering basic queries, reliance on it for more complex scenarios or visual-based diagnostics would be premature.Although our analysis did not evaluate biases in these models, there are known shortcomings in Additionally, the algorithms and data labeling for AI learning may introduce or emphasize certain features or data points [22].This becomes especially problematic in healthcare, where biased models can lead to unequal treatment for specific patient groups.Training these models on data predominantly sourced from particular populations might associate certain diseases with specific demographic factors, furthering potentially detrimental stereotypes.Despite the advancements of AI, it is critical to consider these limitations and use these models as complementary resources rather than absolute substitutes for human expertise.
While our methodology was designed to broadly assess GPT performance, the investigation possessed certain limitations.Most notably, the use of multiple-choice questions does not fully capture the intricate dynamics of clinical decision-making.Although these types of questions can test factual knowledge effectively, these questions oversimplify scenarios and only present the relevant parts of the presentation.In practice, physicians do not simply select from predefined options.They must consider a broad spectrum of symptoms, patient history, and individual factors that may affect potential outcomes.This depth and complexity of a patient necessitate an evaluation of the patient in its entirety.Clinicians then parse out key clinical findings that would lead to the appropriate diagnosis and treatment plan.This level of nuanced understanding and experience cannot be fully tested by this multiple-choice format.Moreover, the restricted pool of post-training data AAO cases made it difficult to conduct a granular analysis of the GPT models' performance across different types of questions.This limitation is particularly relevant when considering the possibility that these models might excel or falter based on specific question categories.

Conclusions
GPT-4.0 outperformed its predecessor GPT-3.5 in answering the AAO "Diagnose This" set of multiple-choice ophthalmic clinical cases, especially in neuro-ophthalmology, with improved accuracy even in image-related questions.Within the span of a couple of months, these vastly improved GPT-4.0 results not only signify the rapid pace of technological advancement in machine learning but also underscore the transformative potential AI holds for affecting ophthalmic student education and medical practice.AI models could serve as a supplementary educational tool for ophthalmology students, providing instantaneous, data-driven feedback that could enrich traditional learning environments.However, it is not without limitations, including the risk of generating fictional information and weakness in more complex inference-based cases.Future studies should not only focus on accuracy but also assess the level of concordance between AI response rationale and expert consensus.This could help in refining the algorithms and ensuring they align more closely with established medical knowledge and clinical best practices.
As we contemplate the increasing role of AI in medicine, it is crucial to remember the ethical and professional obligations that physicians carry.Technology should be seen as a complement to, not a replacement for, the expert.Doctors are ultimately responsible for patient outcomes, and thus, they need to be cautious and selective in the kinds of technology they incorporate into their practice and training regime. photography Jiao et al.Cureus 15(9): e45700.DOI 10.7759/cureus.45700