Artificial Versus Human Intelligence in the Diagnostic Approach of Ophthalmic Case Scenarios: A Qualitative Evaluation of Performance and Consistency

Purpose: To evaluate the efficiency of three artificial intelligence (AI) chatbots (ChatGPT-3.5 (OpenAI, San Francisco, California, United States), Bing Copilot (Microsoft Corporation, Redmond, Washington, United States), Google Gemini (Google LLC, Mountain View, California, United States)) in assisting the ophthalmologist in the diagnostic approach and management of challenging ophthalmic cases and compare their performance with that of a practicing human ophthalmic specialist. The secondary aim was to assess the short- and medium-term consistency of ChatGPT’s responses. Methods: Eleven ophthalmic case scenarios of variable complexity were presented to the AI chatbots and to an ophthalmic specialist in a stepwise fashion. Advice regarding the initial differential diagnosis, the final diagnosis, further investigation, and management was asked for. One month later, the same process was repeated twice on the same day for ChatGPT only. Results: The individual diagnostic performance of all three AI chatbots was inferior to that of the ophthalmic specialist; however, they provided useful complementary input in the diagnostic algorithm. This was especially true for ChatGPT and Bing Copilot. ChatGPT exhibited reasonable short- and medium-term consistency, with the mean Jaccard similarity coefficient of responses varying between 0.58 and 0.76. Conclusion: AI chatbots may act as useful assisting tools in the diagnosis and management of challenging ophthalmic cases; however, their responses should be scrutinized for potential inaccuracies, and by no means can they replace consultation with an ophthalmic specialist.


Introduction
Despite being developed as conversational artificial intelligence (AI) systems, large language model (LLM)based chatbots are currently the focus of intense interest in the field of medicine, particularly in ophthalmology.For example, Chat Generative Pre-trained Transformer (ChatGPT) (OpenAI, San Francisco, California, United States), released in late 2022, has already shown remarkable ability in providing general information and advice for glaucoma [1] and age-related macular degeneration patients [2], answering ophthalmology StatPearls questions [3], and even helping triage ophthalmic emergency cases [4].
AI chatbots have been able to diagnose ophthalmic cases when presented with the full case description, for example, cases extracted from publicly available online clinical databases [5].This is relatively easy when the description includes all relevant data and certain keywords of the eye condition in question.However, in real life, it rarely happens that the ophthalmologist has the full clinical and laboratory data available, and most often a stepwise approach is followed in order to reach the diagnosis (i.e.first listening to the patient's description of symptoms and forming an initial differential diagnosis list, then proceeding to clinical examination and refining our differential diagnosis, then deciding which laboratory and/or imaging tests are needed to narrow our differential diagnosis list further, etc.).Thus, we conducted this qualitative descriptive study with the aim to assess and comparatively evaluate the efficiency of three of the most widely used AI chatbots (ChatGPT-3.5, Bing Copilot (Microsoft Corporation, Redmond, Washington, United States), Google Gemini (Google LLC, Mountain View, California, United States)) in assisting the ophthalmologist in this stepwise process to diagnose and manage certain, sometimes challenging, ophthalmic cases and compare their responses with those of a human ophthalmic specialist.Especially with regard to ChatGPT, we were also interested in assessing its short-term (same day) 1 2 and medium-term (one month apart) consistency in responses as well as its ability to improve its diagnostic performance over time.

Materials And Methods
This study was conducted between November 2023 and February 2024 at the General Hospital of Karditsa, Karditsa, Greece.No identifiable personal data were included in the case scenarios and thus no ethics approval was necessary.Of note, at the time of conducting our study, Google Gemini was still known as Google Bard.
Eleven ophthalmic case scenarios (either factual or fictional) of variable diagnostic difficulty and encompassing various ophthalmic subspecialties were presented to all three AI chatbots and a human ophthalmologist with more than 10 years of post-fellowship clinical experience, who was blind to the diagnoses following the methodology described as follows: (i) Patients' complaints were presented and an initial differential diagnosis was requested and (ii) Clinical examination and relevant laboratory findings were presented and a refined diagnosis was requested.If the diagnosis provided by the chatbot was incorrect, further suggestions were offered to rectify it.If the correct diagnosis was provided by the chatbot, depending on the case, advice on further investigation and management was requested (if not already offered by the chatbot).
A conversational language was used in the interaction with the chatbots, trying to mimic the interaction between humans.To make the process more challenging for both the chatbots and the human ophthalmologist, not all typical symptoms and/or signs of the condition in question were presented in every case and even confounding factors or equivocal findings were introduced in certain cases.We wanted to mimic real life and evaluate the ability of AI to contribute to the diagnostic thought process based on available and sometimes confounding, rather than complete and fully fitting, clinical data.
The above process was repeated one month later for ChatGPT only, using as input the exact same phrasing as the first time.If the correct diagnosis was offered at an earlier stage, no further prompts were given.The exact same process was repeated once again on the same day (i.e. at the one-month time point) to assess the short-term consistency of ChatGPT responses.
The medium-term consistency of ChatGPT responses was evaluated by critically comparing its initial responses to those given one month later.We also evaluated whether it was able to reach the correct diagnosis earlier on that second occasion.Where ChatGPT's responses could be described as items (i.e.item list of initial differential diagnoses and diagnostic workup), the similarity of responses was assessed using the Jaccard similarity coefficient.Otherwise, due to the descriptive nature of the study and the small number of case scenarios involved no formal statistical analysis was performed.

Results
Details of the case scenarios and diagnoses are presented in    "On examination, right eye is conjunctival and episcleral injection mostly marked near the site of a small peripheral corneal stromal infiltration with bridging vessels from limbus.No corneal ulcer.Also marked blepharitis.Diagnosis?" "But there is no ulceration and patient is systemically healthy.Any other diagnosis?" "How to manage this condition?"When asked for an initial differential diagnosis, both human and AI participants provided a reasonable average number of provisional diagnoses, albeit with a wide variation in the number of diagnoses, depending on the scenario presented (Table 4).Remarkably, only the ophthalmologist included the correct diagnosis in the initial list in all cases.In many cases the chatbots proposed additional appropriate potential diagnoses; however, they also occasionally included inappropriate or highly improbable diagnoses (for example, ChatGPT included retinal detachment in the differential diagnosis of the sudden painful loss of vision after trabeculectomy).Where confronted with the task of providing a definite diagnosis, the human ophthalmologist performed better than AI chatbots, missing only one diagnosis due to atypical clinical presentation and equivocal laboratory findings (Table 5).Cases 6 (CSR) and 11 (marginal keratitis) usually required further prompting and were not always correctly diagnosed by the chatbots.

Case 1 (Aqueous Misdirection After Trabeculectomy)
ChatGPT showed better diagnostic performance at one month.While the first time it needed further prompting to reach the correct diagnosis, in our second interaction, it included it in the initial list of potential diagnoses and was able to correctly diagnose the pathology upon presentation of clinical findings.

Case 2 (Diabetic Sixth Nerve Palsy)
Despite the microischemic nature of nerve palsy and the inappropriateness of any immediate diagnostic imaging in this case (as correctly identified by the human ophthalmologist), all three AI chatbots advised prompt brain MRI.In relation to blood tests, ChatGPT offered an appropriate laboratory workup similar to the one proposed by the human ophthalmologist, whereas Bing and Gemini responses were incomplete and contained unnecessary tests (eg.autoimmune tests or tests for syphilis, Lyme disease, and tuberculosis).

Case 3 (Anterior Uveitis Secondary to Sarcoidosis)
Apart from an initial differential diagnosis, Gemini refused to provide any further input in this case, repeatedly referring to its restrictions and prohibitions to provide medical advice.By contrast, both ChatGPT and Bing offered generally correct advice regarding treatment for anterior uveitis and complemented the laboratory workup proposed by the human ophthalmologist.Furthermore, both of them reached the correct systemic diagnosis (i.e.sarcoidosis) when presented with the laboratory findings.

Case 4 (Central Retinal Artery Occlusion (CRAO) Secondary to Giant Cell Arteritis (GCA))
In this case, all three AI chatbots provided a reasonable diagnostic workup for CRAO without major omissions, which complemented the already quite comprehensive plan proposed by the human ophthalmologist.With regard to the systemic diagnosis underlying CRAO, responses varied.ChatGPT gave a clear-cut diagnosis of GCA and proposed further management (ie.temporal artery biopsy and initiation of high-dose systemic steroids).Gemini acknowledged GCA as a possibility but stated that the C-reactive protein (CRP) value was indecisive at this point.By contrast, the human ophthalmologist and Bing offered non-arteritic central retinal artery ischemia as the final diagnosis.Bing justified its response by invoking the equivocal CRP value, while the human specialist added the absence of relevant GCA symptoms (e.g.temporal headache, jaw claudication) as additional factors affecting his diagnostic thinking.Interestingly, when we later returned to our discussion with Bing and added symptoms such as jaw claudication and shoulder pain to the case scenario, it changed its diagnosis to arteritic CRAO.

Case 5 (Charles Bonnet syndrome)
All three chatbots recognized the clinical entity of Charles Bonnet syndrome and included it in the differential diagnosis list.

Case 6 (Central Serous Retinochoroidopathy (CSR))
This case proved to be a tough diagnostic challenge for ChatGPT and Gemini, as they both needed further prompting to reach the correct diagnosis, while Bing provided the correct diagnosis as soon as the clinical findings were presented.By comparison, the human ophthalmologist was the only one to include CSR in the initial differential diagnosis and offered it as the final diagnosis after the presentation of clinical findings.When asked for evidence-based advice on treatment for CSR all three chatbots provided correct answers, although they did not always mention that acetazolamide and aldosterone antagonists are still off-label and do not form part of the standard treatment.

Case 7 (Idiopathic Intracranial Hypertension (IIH))
In general, the "questions to ask the patient" were common in the human and chatbots' responses (e.g.weight gain, medications, pregnancy, nausea-vomiting, changes in vision, diabetes, systemic hypertension).AI chatbots complemented the human ophthalmologist with some additional questions to ask the patient (e.g.onset, duration, and characteristics of headaches and their association with posture or activities).
Regarding final diagnosis, IIH was offered as the most probable (but not exclusive) diagnosis by ChatGPT and Gemini, whereas Bing acknowledged high intracranial pressure as the cause of the patient's symptoms and clinical signs, but did not provide any specific diagnosis.

Case 8 (Third Nerve Palsy Due to Posterior Communicating Artery (PCA) Aneurysm)
AI chatbots offered correct advice regarding diagnostic workup (e.g.brain imaging, blood tests), but their investigation plan was not as comprehensive as that of the human ophthalmologist and, most importantly, they did not always recognize the urgency of the situation.In fact, Bing, but not Gemini, emphasized the need for urgent imaging, while ChatGPT did not advise urgent investigation until our second interaction one month later.

Case 9 (Orbital Pseudotumor)
Chatbots provided a reasonable, albeit incomplete, diagnostic workup plan which nevertheless complemented that which was proposed by the human ophthalmologist (e.g.orbital biopsy and thyroid function tests were not included in the ophthalmologist's response but they were advised by all three chatbots).

Case 10 (Acute Angle Closure Glaucoma Attack (AACG))
Of the three chatbots, ChatGPT provided the most comprehensive management plan including medication doses and timing, while the other two chatbots offered general advice regarding medical and surgical management.This was particularly true for Gemini, which clearly stated that it could only provide general information about potential treatment approaches.However, even ChatGPT was not so comprehensive in its response one month later.Of note, all three chatbots omitted the use of topical steroids.Upon further prompting, they acknowledged their omission and provided a rectified plan.Furthermore, they were able to engage in a conversation regarding the potential inappropriateness of using topical pilocarpine in the acute phase of ACG with a very high intraocular pressure (IOP) and a paralyzed iris sphincter muscle, providing justified arguments.However, it should also be noted that this may not always be the case, as the response of ChatGPT upon testing its short-term consistency at the one-month time point was deemed inaccurate (in our first interaction at one month, it acknowledged that pilocarpine may be ineffective at very high IOP, whereas in our second interaction on the same day, it insisted that pilocarpine may stimulate the paralyzed iris sphincter muscle and its effectiveness is only limited in case of complete angle closure or significant corneal haze).

Case 11 (Marginal Keratitis)
Of the chatbots, only ChatGPT was able to correctly diagnose this case but its performance at one month was

Consistency of ChatGPT responses
Consistency of ChatGPT responses one month apart is shown in Table 6.Jaccard similarity coefficient showed on average a fair overlap of items included in the initial differential diagnosis and diagnostic workup on these two separate occasions.With regard to final diagnosis, ChatGPT offered the correct diagnosis (either as stand-alone diagnosis or as part of further differential diagnosis) in all cases, both the first time and one month later.However, it occasionally needed further prompting in order to reach the correct diagnosis.Case 6 (CSR) turned out to be the most challenging case to diagnose, as it needed additional prompting at both times.

Discussion
In the current study, AI chatbots, especially ChatGPT-3.5 and Bing Copilot, demonstrated remarkable ability as assisting tools in diagnosing and managing challenging ophthalmic cases of various subspecialties.Their individual performance was inferior to that of an ophthalmologist with more than 10 years of postfellowship clinical experience, but still, they provided useful complementary input in the diagnostic thinking process.This is particularly impressive given that they have not been developed as medical diagnostic tools and they quite often make this clear to the reader before they proceed with their response to a medical query; Gemini, in particular, refused on some occasions to provide medical insight invoking its restrictions to serve as a medical tool.It is thus for this reason, as well as their occasional so-called "artificial hallucinations" [6], that their responses should be examined thoroughly for potential inaccuracies and errors and cross-checked with a specialist in case of doubt.
There is an emerging literature comparing AI chatbots, mainly ChatGPT, with humans as regards their diagnostic accuracy in ophthalmology.For example, ChatGPT-3.5 had similar or better accuracy than senior ophthalmology residents in diagnosing primary and secondary glaucoma cases retrieved from a public online database [7].Similarly, ChatGPT-4 outperformed glaucoma specialists and was comparable with retina specialists in diagnostic and treatment accuracy of glaucoma and retina cases [8].By contrast, ChatGPT exhibited reasonable but inferior diagnostic accuracy than human experts in cornea [9], uveitis [10,11], and neuro-ophthalmology [12] cases.Furthermore, in another study, performance of ChatGPT-3.5 in diagnosing hospitalized ophthalmic patients with various, sometimes complex, eye conditions was poorer than that of residents and attending ophthalmologists [13].In our study we used the free-of-charge, readily accessible 3.5 version of ChatGPT and compared it against a quite experienced ophthalmologist.Given the superior performance of ChatGPT-4 compared to the 3.5 version it could be that the incorporation of this newest version in the diagnostic toolbox could potentially increase the diagnostic accuracy of ophthalmologists even more, but it should be emphasized that any chatbot may be used simply as an assistant, not as the definitive diagnostic tool.
When it comes to comparing different AI chatbots, current literature suggests better performance of ChatGPT than Google Bard/Gemini in triaging and diagnosing simulated ophthalmic patient complaints [14] or in providing an accurate and coherent surgical plan for glaucoma [15] and vitreoretinal [16] cases.Furthermore, ChatGPT-3.5 was found to be more accurate than Bing and Google Bard/Gemini in answering patients' questions about age-related macular degeneration [2].Moreover, in a study evaluating the performance of ChatGPT (versions 3.5 and 4) and Google Bard/Gemini in answering common inquiries regarding ocular symptoms, it was found that ChatGPT-4 outperformed ChatGPT-3.5 and Google Bard; however, all chatbots exhibited only moderate self-awareness capabilities and modest self-improving capabilities over time [17].
With regard to consistency in its responses in our study, ChatGPT-3.5 exhibited reasonable short-and medium-term consistency.We think that some variability is actually to be expected, given that LLM chatbots have been developed as AI conversational counterparts and not as medical software tools, of which one would require high repeatability.
Given the currently rapidly expanding literature on the subject, at the time of writing, and to the best of our knowledge, ours was the first study to comparatively assess three of the most widely used LLM chatbots as regards their ability to diagnose challenging ophthalmic case scenarios and to provide advice on further investigation and management.Importantly, we tried to imitate real life in that not all typical clinical symptoms and signs of a condition are always present and to follow the diagnostic thinking process of human and AI in a stepwise fashion rather than present them with a full case description with all the relevant clinical and laboratory/imaging data.In addition, we are the first to evaluate ChatGPT's consistency of responses on the same day and over a period of a few weeks in the above context.
On the other hand, we need to acknowledge our subjectivity in evaluating the chatbots' responses as well as the fact that we used a limited number of case scenarios, which, despite touching on various ophthalmic subspecialties, may not accurately capture the complexity of other ophthalmic clinical entities.Therefore, we cannot generalize our findings to all eye conditions.Moreover, we used the ChatGPT-3.5version which, although free to use, is inferior to the more advanced, but requiring subscription fees, GPT-4 version.However, Bing AI chatbot benefits from incorporating GPT-4 in its responses, therefore we may have indirectly involved it as well in our study.Finally, we should keep in mind the time-sensitive nature of our findings, given the constant updates and improvement of LLM chatbots.Future studies could thus indicate even better diagnostic performance of chatbots and provide insight into their consistency in the longer term.

Conclusions
ChatGPT-3.5 and Bing Copilot chatbots proved to be useful assisting tools in diagnosing and managing certain challenging ophthalmic cases.They both outperformed Google Gemini, although they were inferior to a fellowship-trained ophthalmic specialist.ChatGPT provided fairly consistent responses in the short and medium term.Our findings underscore the potential of LLM chatbots to assist in the diagnostic thinking process; however, they cannot and should not substitute the clinician.
year-old man complains of sudden-onset diplopia and headache.On examination left eye is squinting outwards and downwards, there is limitation of adduction.Also left eye lid ptosis.Left eye pupil is larger.Diagnosis and further investigation and management?" 9 Orbital pseudotumor "38-year-old man complains of sudden-onset painful proptosis of right eye with diplopia.On examination there is right eye proptosis with limitation of eye movement especially abduction.Also pain around the eye.No known thyroid disease.Differential diagnosis?""Based on clinical examination and previous history I suspect he has orbital pseudotumor/Tolosa Hunt syndrome.What investigation should I do to confirm?year-old female complains of acute pain around the right eye, blurry vision and nausea for the last few hours.In the last few days occasionally haloes around lights lasting for a few minutes.Diagnosis?""Ophthalmologist checked IOP= 55mmHg, hazy cornea, fixed mid-dilated pupil, not possible to do gonioscopy due to hazy cornea.Shallow anterior chamber, Van Herrick 1:4.Diagnosis?""Can you give me a management plan for AACG?" 11 Marginal keratitis "27-year-old contact lens wearer complains of red right eye and pain.Differential diagnosis?"

Case Diagnosis Description/Questions used as input to chatbots
"Man who underwent trabeculectomy 3 days ago calls to say that he has sudden painful loss of vision.Differential diagnosis?""On examination IOP=50mmHg, almost flat AC.Fundus looks ok.Diagnosis?""The bleb is flat and the IOP is definitely high, so no hypotony.Also macula was ok on fundoscopy, so no maculopathy.Any other thoughts?"questions would you further ask the patient to clarify diagnosis?""Headaches started 2 weeks ago, in the morning and worse when bending down, come and go.Nausea yes, but no vomiting.No other neurological symptoms.On diet pills (she is overweight).No previous history of hypertension, today blood pressure 120/80 mmHg.Diagnosis?" 8 Acute third nerve palsy

TABLE 1 :
Case scenarios included in the study (diagnosis and sequence of description/questions used as input to chatbots).Examples of further prompts given to chatbots in case of erroneous response are presented in italics.IOP: intraocular pressure; AC: anterior chamber; PMH: previous medical history; MRI: magnetic resonance imaging; ACE: angiotensin converting enzyme; ESR: erythrocyte sedimentation rate; CRP: C-reactive protein; CBC: complete blood count; VA: visual acuity; CRVO: central retinal vein occlusion; AMD: 2024 Mandalos et al.Cureus 16(6): e62471.DOI 10.7759/cureus.624713 of 11 age-related macular degeneration Examples of human and AI responses are presented in Tables 2, 3.

TABLE 4 : Descriptive analysis of initial differential diagnosis.
* Gemini refused to give any DDx on three occasions; on another three occasions, we needed to insist on getting a DDx.DDx: differential diagnosis; Dx: diagnosis; N/A: not applicable 2024 Mandalos et al.Cureus 16(6): e62471.DOI 10.7759/cureus.624715 of 11

Table 7 )
as measured by Jaccard similarity coefficient seemed to be slightly better than medium-term consistency.However, concerning the final diagnosis, there were two occasions (case 6 and case 7) where although it offered the correct diagnosis the first time, upon repeating the process on the same day it failed to accurately diagnose the condition.For example, in case 7 (IIH), ChatGPT recognized the high intracranial pressure issue, but did not offer IIH as the potential cause.

TABLE 7 : Short-term consistency of ChatGPT responses.
required further prompting; ** correct Dx offered as part of further DDx; *** recognized high intracranial pressure problem, but did not give the definite Dx (idiopathic intracranial hypertension) *