Performance of Artificial Intelligence (AI)-Powered Chatbots in the Assessment of Medical Case Reports: Qualitative Insights From Simulated Scenarios

Introduction With the expanding awareness and use of AI-powered chatbots, it seems possible that an increasing number of people could use them to assess and evaluate their medical symptoms. If chatbots are used for this purpose, that have not previously undergone a thorough medical evaluation for this specific use, various risks might arise. The aim of this study is to analyze and compare the performance of popular chatbots in differentiating between severe and less critical medical symptoms described from a patient's perspective and to examine the variations in substantive medical assessment accuracy and empathetic communication style among the chatbots' responses. Materials and methods Our study compared three different AI-supported chatbots - OpenAI’s ChatGPT 3.5, Microsoft’s Bing Chat, and Inflection’s Pi AI. Three exemplary case reports for medical emergencies as well as three cases without an urgent reason for an emergency medical admission were constructed and analyzed. Each case report was accompanied by identical questions concerning the most likely suspected diagnosis and the urgency of an immediate medical evaluation. The respective answers of the chatbots were qualitatively compared with each other regarding the medical accuracy of the differential diagnoses mentioned and the conclusions drawn, as well as regarding patient-oriented and empathetic language. Results All examined chatbots were capable of providing medically plausible and probable diagnoses and classifying situations as acute or less critical. However, their responses varied slightly in the level of their urgency assessment. Clear differences could be seen in the level of detail of the differential diagnoses, the overall length of the answers, and how the chatbot dealt with the challenge of being confronted with medical issues. All given answers were comparable in terms of empathy level and comprehensibility. Conclusion Even AI chatbots that are not designed for medical applications already offer substantial guidance in assessing typical medical emergency indications but should always be provided with a disclaimer. In responding to medical queries, characteristic differences emerge among chatbots in the extent and style of their respective answers. Given the lack of medical supervision of many established chatbots, subsequent studies, and experiences are essential to clarify whether a more extensive use of these chatbots for medical concerns will have a positive impact on healthcare or rather pose major medical risks.


Introduction
Managing the rapid expansion of medical knowledge poses a central challenge for physicians: while the rate of doubling medical knowledge was estimated at 50 years in 1950, this interval has decreased significantly to approximately 73 days in 2020 [1].At the same time, the widespread adoption of AI-powered chatbots has revolutionized various domains and continues to progress at considerable speed.This phenomenon is exemplified by the rapid attainment of a user base of one million users within a span of merely five days by OpenAI's ChatGPT [2].AI-powered chatbots, fueled by sophisticated natural language processing and machine learning algorithms, offer various potential applications in the medical field such as identifying research topics, assisting professionals in clinical and laboratory diagnosis, providing updates to healthcare professionals, and developing virtual assistants for patient health management [3].Furthermore, AI demonstrates substantial promise within research-focused domains, such as rare diseases, encompassing target identification, biomarker discovery, preclinical optimization, patient recruitment, real-world data analysis, and precision medicine approaches across developmental stages [4].Nevertheless, their integration into medical decisionmaking processes necessitates rigorous evaluation to ensure patient safety and effective communication.Physicians recognize the potential benefits of chatbots in health care for supporting patients and streamlining tasks.Yet, concerns persist regarding their limitations in understanding human emotions and providing expert medical intelligence [5].For instance, 94% of queried physicians in a Swiss survey rejected a diagnosis solely provided by intelligent software [6].Rather than focusing on employing AI technology to supplant fundamental human aspects in healthcare, the current emphasis is on augmenting the progression towards a collaborative model of integrating AI chatbots and medical professionals.There is only a limited likelihood of complete substitution due to the intricate nature of healthcare requiring human involvement [7].While the retrieval of factual knowledge as well as citation accuracy of AI chatbots currently appears to be improvable, the potential of these systems is already evidenced by their capability to successfully pass the United States Medical Licensing Examination [8,9].Furthermore, a spectrum of diverse chatbots tailored and authorized for health-related inquiries has already been successfully established [10].However, in line with an expected trend of increasing queries concerning medical matters, particularly on platforms not explicitly designed or authorized for this purpose, several questions remain unanswered: How accurately do these chatbots differentiate between severe and less critical medical symptoms?How effectively do they convey diagnostic information to users?Moreover, what role does empathy play in their interactions with patients?Our study aims to evaluate three popular and emerging AI-supported chatbots in analyzing case reports representing both medical emergencies and non-urgent cases.Shedding light on the capabilities and limitations of AI chatbots, we assess and compare their performance in diagnosing and triaging patient-reported symptoms and explore how they communicate their recommendations, considering both medical accuracy and patient-oriented language.

Materials And Methods
The surveyed AI chatbots in our study were OpenAI's Chat-GPT 3.5, Bing Chat (operating with Chat-GPT 4.0, using a "balanced" setting) as well as Inflection's Pi AI.All chatbots were accessed with Microsoft's Edge Browser between November 15 and 17, 2023, using new chats for each query to avoid memory effects.A total number of six fictional medical case studies were compiled for the purpose of this study.These case studies were deliberately constructed to represent realistic emergency diagnoses in the group of typical users of AI chatbots.All six case studies were described to the three AI chatbots in an identical formulation and formatting style, choosing a patient-related language to describe the respective symptoms without using medical terms.Three typical cases representing medical emergencies from different medical disciplines were analyzed: acute appendicitis, acute coronary syndrome, and acute suicidal tendencies (Table 1).

Original Prompt
Acute appendicitis "I'm a 29-year-old man and I've been having really bad stomach pains for a few hours now, mostly on the right side.Feels like my stomach's cramping up, and the pain keeps getting worse.I also feel like I'm gonna throw up even though I could hardly eat anything.Each of these six case studies was directly followed by an identical question to each of the three examined AI chatbots (the formulation of prompts was intentionally constructed in a manner resembling interactions that a less experienced AI user might engage in with the programs): "Could you assist me in identifying what might be causing my symptoms?Is it urgent for me to go to the emergency room today, or is it sufficient to see my general practitioner tomorrow?"The answers of the chatbots to these questions asked were then qualitatively evaluated and compared with each other along different guiding questions within the following categories (Table 3).

Differential diagnoses
Has a sufficiently comprehensive list of realistic diagnoses been compiled and prioritized with regard to the specific case?

Further instructions
Have appropriate statements been made as to which further options for action and behavior are recommended (e.g., suggestions for initial therapy options or additional contact points for help)?

Urgency
Has the urgency for emergency medical evaluation been correctly assessed and communicated to the patient?

Medical disclaimer
Has it been explicitly emphasized that a chatbot is not a substitute for a thorough assessment by medical professionals?

Empathy
Has a sensitive and supportive language style been chosen?

TABLE 3: Evaluation metrics for chatbot responses with criterion illustrations
Both authors jointly assessed all cases in a qualitatively comparative manner and evaluated the response quality of the respective chatbot according to the gradations given in Table 4. Due to the characteristic response styles of the examined chatbots, blinding of the studied cases was not implemented.

High
The response adequately demonstrated the desired quality Distinctive level of empathy and avoidance of incomprehensible medical terminology

Moderate
The response showed some elements of the desired quality but required

Results
By employing the described approach and applying the corresponding prompts, the following results were

TABLE 5: Full-text responses of chatbots to six case reports
The original formatting of the responses was adjusted to fit into this table .For the sake of clarity, additional images, icons, and references that were included in the responses have been omitted from this table.
Case 1, focusing on "Acute appendicitis," reveals that all three chatbots provided moderate evaluations for the identification of differential diagnoses.In terms of further instructions, urgency, and adherence to medical disclaimers, all three chatbots consistently demonstrated high levels across these criteria.However, there were variations in the expression of empathy, with Bing Chat and Pi AI exhibiting high empathy, while ChatGPT 3.5 displayed a moderate level of empathy.
With Case 2, centered around "Acute coronary syndrome," differences emerge in the evaluation of differential diagnoses, with Bing Chat rated as moderate, while ChatGPT 3.5 and Pi AI both received low ratings.Similar patterns in further instructions and urgency are observed, with all chatbots scoring high, except for Pi AI in further instructions.Notably, there is a discrepancy in the medical disclaimer criterion, where Pi AI received a low rating compared to the high ratings of ChatGPT 3.5 and Bing Chat.Empathy levels varied, with ChatGPT 3.5 displaying moderate empathy, while Bing Chat and Pi AI exhibited high empathy.
Case 3, addressing "Acute suicidal tendency," highlights moderate ratings across differential diagnoses for ChatGPT 3.5 and Bing Chat, while Pi AI received a low rating.Further instructions, urgency, and empathy were rated differently for all three chatbots, with ChatGPT 3.5 receiving the highest rating three times, Bing Chat receiving a "high" rating twice and a "moderate" rating once, and Pi AI receiving a "high" rating once and a "moderate" rating twice.Furthermore, there were variations in the medical disclaimer criterion, with Pi AI scoring high compared to moderate ratings for ChatGPT 3.5 and Bing Chat.
Case 4, focusing on "Uncomplicated respiratory infection/bronchitis," shows moderate evaluations for differential diagnoses for all chatbots, except for Bing Chat.High ratings are consistently observed in further instructions and urgency, while only Bing Chat scored high on medical disclaimers.Empathy, however, displayed no variability, with all chatbots receiving moderate ratings.
Case 5, addressing the "Unclear skin lesion," reveals moderate ratings across differential diagnoses for ChatGPT 3.5 and Pi AI.Bing Chat, by contrast, received a high rating in this category.High ratings for all chatbots are observed in further instructions and urgency, while only Bing Chat scored high on medical disclaimers.Empathy ratings differ, with ChatGPT 3.5 and Pi AI displaying high empathy, while Bing Chat received a moderate rating.
Finally, Case 6, centered around a "Bruise," demonstrates high ratings for differential diagnoses for ChatGPT 3.5, while Bing Chat and Pi AI only received moderate ratings.Regarding further instructions, urgency, and medical disclaimers, Bing Chat scored high three times, followed by ChatGPT 3.5, which achieved a "high" rating twice, and Pi AI, which obtained a "high" rating once.Empathy varied as well, with ChatGPT 3.5 and Bing Chat displaying moderate empathy, while Pi AI received a high rating (Table 6).

Quality and accuracy of the provided medical information and assessments
Overall, it can be observed that all three examined chatbots demonstrate a high quality in breaking down the case report into the most relevant diagnoses.All stated conclusions are medically justifiable and presented in factually correct and understandable language for patients.This circumstance is particularly noteworthy considering that all three examined chatbots are not medically supervised or specifically developed applications for this purpose, such as symptom checkers like Ada or WebMD, for which a higher diagnostic performance has already been demonstrated compared to ChatGPT [11].All presented responses by the chatbots were devoid of "hallucinations".However, given the brevity of case descriptions and follow-up queries, this might change in more detailed case reports or dialogues with multiple subsequent inquiries.Comparing this result with another study that also examined ChatGPT 3.5 and Bing Chat, among others, regarding their problem-solving accuracy in hematological case reports, it becomes evident that in that study, ChatGPT 3.5 achieved higher accuracy despite comparatively inferior technical specifications.From our perspective, this deviation could be associated with the distinct nature of hematological laboratory data compared to the more qualitative and verbal case descriptions in our study [12].Furthermore, it was noticeable that all three examined chatbots occasionally provided information regarding potential acute therapy, even when not explicitly requested in the prompt.This 'intelligent expansion' of the original inquiry was particularly evident in the responses from Bing Chat.Overall, Bing Chat provided the most comprehensive presentation of medically probable differential diagnoses, a factor that could be attributed to the utilization of ChatGPT 4.0.This might explain the slight discrepancy compared to the responses generated by ChatGPT 3.5.In contrast, Inflection's Pi AI consistently employed a shorter and more concise style of response, which, within the scope of this analysis, occasionally compromised a comprehensive medical differential diagnosis.The visual presentation of the respective chatbot response was not separately evaluated as a criterion in this study.Nonetheless, varying approaches were observed among the considered chatbots in this regard.The concise responses from Pi AI were displayed as a single text block, aligning with the conversational opening style of a personal assistant, as described by Pi AI's manufacturer.However, this format compromised readability and clarity.In contrast, ChatGPT 3.

Accuracy in differentiating between medical emergencies and less critical situations
All three examined chatbots accurately classified the somatic emergency situations (Case Reports No. 1 and No. 2) as well as all less critical scenario cases (4-6), respectively.It should be noted that all case studies represent typical example scenarios with a fairly simple risk stratification since difficult decisions or borderline scenarios were deliberately avoided in the case selection process.Nevertheless, the psychiatric emergency involving acute suicidal risk (Case Report No. 3) was underestimated in its medical significance and urgency by Bing Chat and Pi AI.The risk of acute suicidal tendencies in the patient was clearly articulated in the case report and should have been duly acknowledged.While all chatbots indicated the necessity for further medical assessment, only ChatGPT directly addressed the expressed self-harming tendency.Additionally, in this case example, it was notable that only Pi AI utilized a clear disclaimer stating its inability, as a chatbot, to provide diagnoses or medical recommendations.This circumstance could indicate that, at present, chatbots find it more challenging to assess psychosocial-mental situations compared to somatic scenarios, thus potentially underestimating their urgency or overall providing poorer evaluations.This finding is complemented by another study specifically focused on the analysis of suicide risks by chatbots, revealing that ChatGPT 3.5 occasionally underestimates these risks compared to ChatGPT 4.0, which aligns more closely with assessments made by medical professionals [13].

Handling the situation of being consulted for medical queries
For all the examined case reports, it can be observed that none of the three tested chatbots would have 'presumed' to independently provide a definitive medical diagnosis without referring to the necessity of a physical medical consultation and symptom assessment.This aspect can be considered a minimum quality standard for every case.However, the clarity with which this message was conveyed varied among the chatbots.While particularly ChatGPT 3.5 and Bing Chat consistently emphasized that their statements did not come from qualified medical personnel, this statement was sometimes less explicit in Pi AI's responses and could only be deduced between the lines or inferred from the reference to seeking medical consultation.Our finding that ChatGPT reacts rather 'sensitively' in this context is also consistent with another study examining the diagnostic accuracy of ChatGPT in distinguishing rheumatic diseases from other pathological processes.This study revealed that ChatGPT exhibits notable sensitivity, surpassing even that of human rheumatologists [14].

Patient-centeredness, empathy, and "human touch"
In comparing response styles and their orientation towards the needs of potential patients and medical laypersons, Pi AI exhibited the most empathetic response style among the six examined case scenarios.Pi often explicitly conveyed compassion and understanding, packaging its response in an understandable and everyday language style, which distinguished itself from the somewhat more factual and evidence-based responses of ChatGPT 3.5 and Bing Chat in this regard.This aspect, too, could be attributed to Pi AI's intended orientation as a "personal assistant."Additionally, it was noticeable that the generally high level of empathy expressed by all chatbots was more pronounced in emergency scenarios (Case Reports No. 1, No. 2, and No. 3) than in less urgent cases, such as respiratory symptoms.This highlights an intriguing parallel to the responsibility of medical professionals, particularly in emergency situations, to exhibit a calming and compassionate demeanor toward patients.Regarding the comprehensibility of medical information, at a formal level, the previously described differences in text length and formatting between Pi AI on one hand and ChatGPT 3.5 and Bing Chat on the other hand were apparent.However, the comprehensibility is subject to individual preferences regarding a factual bullet-point style versus a verbalized conversational style.On the substantive level, all three chatbots predominantly utilized a layman-friendly and understandable language, though certain technical terms like "appendix ruptures" (ChatGPT 3.5) and "GI issues" (Pi AI) might not be immediately comprehensible to readers of varying educational backgrounds.Evidence indicates that laypersons frequently rate chatbot responses to medical queries as more empathetic and of higher quality compared to those provided by human physicians, possibly attributable to the more detailed response structure commonly found in chatbots [15].

Further directions for future research
The pivotal role of human empathy and trust as essential factors for sensitive communication and competent treatment prompt an inquiry into the extent to which these "human" factors can be addressed or impacted by technological advancements.Furthermore, the use of chatbots may raise ethical and legal issues, such as data privacy, biases, or liability for the algorithm's databases and its recommendations.The ethical and legal implications of using AI chatbots for emergency care may not be fully understood or regulated by the relevant authorities or stakeholders.Against this backdrop, it is crucial to examine the integration of medically reviewed, authorized, and supervised applications into the healthcare system and its workflows.The future applications of AI-supported chatbots may extend beyond aiding in clinical assessments to alleviating the operational workload of medical staff and contributing to research-related tasks [16].

Limitations of this study
Only three different chatbots were examined in this study, which does not provide a comprehensive overview of the rapidly expanding market of these programs.Moreover, the development of the analyzed chatbots is dynamic and swiftly progressing.For instance, Bing Chat operates on the foundation of ChatGPT 4.0, thus being technically more advanced and utilizing a broader database compared to the publicly accessible version of OpenAI, which employs ChatGPT 3.5.In November 2023, it was announced that Open AI is currently working on ChatGPT-5 [17].To prevent exceeding the scope of this investigation, only a total of six medical case studies were selected for analysis.These cases are fictional, constructed scenarios involving patients capable of summarizing their symptoms briefly.As such, this setup does not allow for extrapolation to hyper-acute emergency situations.The analysis of chatbot response quality was based on a single answer statement.There was no extensive dialogue with in-depth interaction between the patient and the chatbot, nor was there an opportunity for further inquiries.As a result, some strengths of the respective applications, such as Pi AI's orientation as a "personal assistant," might not have been adequately highlighted.The assessment and evaluation of response quality along the described evaluation categories are based on the subjective medical perspective of the authors and were conducted in a qualitative and comparative manner.Therefore, drawing a universally applicable conclusion regarding the overall diagnostic precision of AI-assisted chatbots is not possible based solely on this study.

Conclusions
Even AI chatbots that are not designed for medical applications already offer substantial guidance in assessing typical emergency medical indications, differentiating them from less critical cases and delivering relevant information in a comprehensible and empathetic way.Within our analysis, Microsoft's Bing Chat and, to a lesser extent, OpenAI's ChatGPT 3.5 provided the most comprehensive and medically detailed responses to typical case scenarios.In contrast, Inflection's Pi AI tended to adopt a more concise and empathetic dialogue style.An explicit medical disclaimer was provided in the majority of cases, which should be a non-negotiable criterion for any medical-related request.
With increasing usage figures of AI-supported chatbots in general, their assessment quality of medical issues becomes increasingly relevant.In this respect, the healthcare system may need to adjust to the reality of more patients being sent to the emergency room "by ChatGPT".Given the lack of medical supervision of many established chatbots, it is not yet clear whether extended use of these chatbots for medical issues will have a beneficial effect on healthcare or rather pose major medical risks.Potential benefits of wellfunctioning programs could include an optimized patient risk stratification, while poor programs could lead to dangerous consequences such as overlooked emergencies and delayed referral of patients to their necessary treatment.
Using the latter technique, for instance, in a follow-up study, the technical weaknesses of the tools could be analyzed in more depth.The range of presented differential diagnoses varied in detail.For instance, Pi AI mentioned only a single -albeit medically accurate -diagnosis in Case Report 2 (acute coronary syndrome) without delving into further potential causes.Conversely, Bing Chat provided a comprehensive differential diagnosis in the dermatological case example (No. 5), and uniquely included COVID-19 in its list of possible diagnoses in Case Report 4. This could be attributed to the enhanced and more current research capabilities of Bing Chat leveraging the utilization of ChatGPT 4.0.
5 and Bing Chat generated responses divided into thematic sections, enhancing readability and overall structure.This presentation of responses amplifies the factual and objective impact of the conveyed information.Within this context, an evolution from ChatGPT 3.5 to 4.0 is evident, as Bing Chat (4.0) integrates source references, visual elements, and icons into responses, a feature absent in ChatGPT 3.5.

Table 2
)." I've had a persistent cough since the day before yesterday, with some mucus coming out, and I feel tired and exhausted.I also have a slight sore throat and a bit of a fever."Unclear skin lesion "I've noticed a skin change here on my arm.It looks like a red spot that I didn't have a week ago.It sometimes itches and feels a bit uncomfortable, but it doesn't hurt.I'm not sure what it is, but it worries me."Uncomplicated bruise "I tripped yesterday and hit the edge of the table with my hip.Meanwhile, I have a bruise in this area and slight pain when I press there.I'm not sure if this is normal or if it could be something worse."