Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing

Introduction: ChatGPT has been tested in many disciplines, but only a few have involved hearing diagnosis and none to physiology or audiology more generally. The consistency of the chatbot’s responses to the same question posed multiple times has not been well investigated either. This study aimed to assess the accuracy and repeatability of ChatGPT 3.5 and 4 on test questions concerning objective measures of hearing. Of particular interest was the short-term repeatability of responses which was here tested on four separate days extended over one week. Methods: We used 30 single-answer, multiple-choice exam questions from a one-year course on objective methods of testing hearing. The questions were posed five times to both ChatGPT 3.5 (the free version) and ChatGPT 4 (the paid version) on each of four days (two days one week and two days the following week). The accuracy of the responses was evaluated in terms of a response key. To evaluate the repeatability of the responses over time, percent agreement and Cohen’s Kappa were calculated. Results: The overall accuracy of ChatGPT 3.5 was 48-49%, while that of ChatGPT 4 was 65-69%. ChatGPT 3.5 consistently failed to pass the threshold of 50% correct responses. Within a single day, the percent agreement was 76-79% for ChatGPT 3.5 and 87-88% for ChatGPT 4 (Cohen’s Kappa 0.67-0.71 and 0.81-0.84 respectively). The percent agreement between responses from different days was 75-79% for ChatGPT 3.5 and 85-88% for ChatGPT 4 (Cohen’s Kappa 0.65-0.69 and 0.80-0.85 respectively). Conclusion: ChatGPT 4 outperforms ChatGPT 3.5 both in accuracy and higher repeatability over time. However, the great variability of the responses casts doubt on possible professional applications of both versions.


Introduction
Chatbots, conversational systems based on large language models (LLMs), have recently revolutionized both public and professional spheres across various disciplines [1,2].Efforts to explore their applications are growing, with significant interest in sectors such as healthcare -where chatbots are being tested as patient support tools and as educational aids in scenario-based training [3,4].Among these chatbots, ChatGPT, developed by OpenAI (San Francisco, CA, USA), is perhaps the most well known, prompting much interest in chatbot technology [5].
Presently, two versions of ChatGPT are accessible to the general public: the freely available ChatGPT 3.5, based on an earlier LLM, and the more advanced, subscription-based ChatGPT 4. Research has shown that ChatGPT 4 outperforms its predecessor, demonstrating superior accuracy in various fields such as dermatology, where it achieved 80-85% accuracy compared to 60-70% for ChatGPT 3.5 [6].Similar improvements are observed in orthopedic assessments and in general medical examinations, with ChatGPT 4 consistently outperforming ChatGPT 3.5 [7,8].
Despite broad testing across scientific and medical fields, audiology remains underexplored.A search on PubMed for "chatgpt [Title/Abstract] AND audiology [Title/Abstract]" returned no results, compared with 35 papers found for otolaryngology and even more in fields like dermatology and ophthalmology (as of April 5, 2024).Preliminary studies in audiology suggest that while ChatGPT, alongside other chatbots like Google Bard (now Gemini) and Bing Chat (now Copilot), shows promise, it also exhibits errors and inaccuracies that underscore the need for careful oversight when used in specialized fields [9].This is particularly evident in some audiology subtopics such as tinnitus, where the responses, although quite impressive, occasionally stray from the topic and, crucially, totally lack citations [10].These latter two studies suggest that ChatGPT has the potential to provide information in more specialized medical fields like audiology and in specific topics like tinnitus, but still requires improvement before being reliable enough for serious applications.
While the correctness of ChatGPT's responses seems quite well researched, there seems to be little information on test-retest repeatability, with only a few studies on that aspect available so far.For example, in one study exploring how ChatGPT might be used to create norming data, the reliability was about 85% when results from different days were compared [11].Similarly, another study on ChatGPT's performance in prosthodontics showed a repeatability of 88% [12].However, it is unlikely that results from dentistry provide a good estimate of reliability in audiology or otorhinolaryngology.
This study aims to fill these gaps by assessing the accuracy and repeatability of ChatGPT versions 3.5 and 4 in answering test questions about the physiology of hearing.Objective measures such as tympanometry [13], middle ear muscle response [14], otoacoustic emissions [15], and auditory brainstem responses [16] are paramount in audiology and here they served as the basis for our investigation.By focusing on the shortterm repeatability of responses -within a single day, across two days, and over a week -this study sought to determine the feasibility of deploying ChatGPT in a clinical audiology setting, evaluating its potential as a reliable diagnostic or educational aid.

Materials And Methods
The responses of two versions of OpenAI's chatbot ChatGPT to a set of single-choice questions were evaluated.The versions were ChatGPT 3. We used a set of 30 single-answer, multiple-choice questions developed by one of the authors (KK).The questions were selected from exams given to students undertaking a one-year course on objective hearing testing which is conducted at the Institute of Physiology and Pathology of Hearing, Poland, over the last 20 years.The questions relate to physiological measurements of hearing status, in particular tympanometry, middle ear muscle responses, otoacoustic emissions, and auditory brainstem responses.The questions are based on common audiological knowledge described in research papers as well as in books (both in English (e.g.[17]) and our native Polish (e.g.[18])).We did not use questions from any published book as copyright makes it difficult to publish the questions and the key.All questions had a choice of four answers from which one was correct.Table 1 presents examples of two questions, while all the questions together with the key are provided in the Appendices.Although our native language is Polish we presented the questions to the chatbots in English translation in order to make it international.All analyses were made in Matlab (version 2023b, MathWorks, Natick, MA, USA).Percent agreement and Cohen's Kappa [19] were used to evaluate agreement, correctness of responses, and test-retest repeatability.

Question
We used both numerical measures as percent agreement is more accessible to the common reader whereas Cohen's Kappa has the advantage that it also tests for the possibility of agreement occurring by chance.The values of percent agreement and Kappa can be interpreted as <0.0, poor; 0.0-0.2,slight; 0.2-0.4,fair; 0.4-0.6,moderate; 0.6-0.8,substantial; and 0.8-1.0,almost perfect agreement [20].Repeated measures analysis of variance (rmANOVA) was used to assess the effect of testing at different moments.For pairwise comparisons a t-test was used, or a nonparametric Mann-Whitney U-test.In all analyses, a 95% confidence level (p < 0.05) was taken as the criterion of significance.

Results
A snapshot of one sample of ChatGPT 3.5 and ChatGPT 4 responses to the first five questions is shown in Figure 1.First the percentage of questions receiving a correct answer on all 20 trials for both versions of ChatGPT was evaluated (Figure 2).For ChatGPT 3.5 the percent of questions that received correct answers at all trials was 30%, while for ChatGPT 4 the corresponding figure was 50%.There was also a fraction of questions that received varied responses, sometimes receiving a correct answer and sometimes an incorrect one.There were also questions that were always answered incorrectly.The average accuracy of responses (i.e.correctness in relation to the response key) was 48-49% for ChatGPT 3.5 and 65-69% for ChatGPT 4 (Table 2).When looking at minimum and maximum agreement (bottom row of Table 2) there is quite a large spread in the results.Version 3.5 has a variability of 40-60% across all trials, while version 4 has 60-76.7%.rmANOVA of percent agreement results revealed a significant effect of ChatGPT version (F(1,8) = 152.1,p < 0.001), but no effect of day or interaction of day and ChatGPT version.The criterion for a pass for the exams from which the questions were derived was 65%.ChatGPT 3.5 did not reach this level even once, with its highest score being 60%.It did manage to surpass the 50% criterion in 25% of trials (five trials out of 20).On the other hand, ChatGPT 4 surpassed the 65% criterion in 50% of trials (10 trials out of 20).

Week at which test was done
Next, we analyzed test-retest repeatability.Each single set of responses was compared with the other sets from the same day or different day, irrespective of their correctness.
The repeatability of two versions of ChatGPT within the same day is given in Table 3.It can be seen that the average percent agreement within one day was 76-79% for ChatGPT 3.

Discussion
None of the ChatGPT versions we tested provided satisfactory performance regarding questions on auditory physiology.Nonetheless, ChatGPT 4 surpassed ChatGPT 3.5 in both accuracy and consistency over time, demonstrating a 20% improvement in accuracy and a 10% enhancement in consistency.Notably, ChatGPT 3.5 failed to gain a pass mark on any of the 20 trials, while ChatGPT 4 achieved a 50% pass mark.Despite these differences, the response variability remained consistent across different time intervalsminutes, days, and a week.We observed that responses were either consistently correct, consistently incorrect, or variable.For ChatGPT 4, 50% of responses were consistently correct, compared to 30% for ChatGPT 3.5.A notable example of an incorrect response is that for question 3, which asked for the frequency of the measuring tone in tympanometry for a child aged three months.While the typical frequency for adults is 226 Hz (and that was known by ChatGPT), it is well established that for children up to six months of age the test frequency should be 1000 Hz [21,22].This has been known for several years and is basic information in contemporary audiology, so it is hard to know why ChatGPT made this mistake.
Both versions of ChatGPT occasionally deviated from our instruction to provide only a letter corresponding to the correct answer, sometimes elaborating with a full sentence.It also happened that ChatGPT did not respond for several minutes.We experienced this only in version 4. This idiosyncrasy and instances of delayed responses, particularly with ChatGPT 4, underscore the chatbot's limitations in strictly adhering to instructions.
When our results are placed alongside studies from other disciplines that have examined ChatGPT's performance on single-answer and multiple-choice questions, we observe a similar trend -a roughly 20% performance improvement with ChatGPT 4 over ChatGPT 3.5 [6,[23][24][25].Studies in otolaryngology, closely related to audiology, have also shown lower performance scores for ChatGPT 3.5, supporting our findings of improved outcomes with the newer version [26,27].
A separate concern with ChatGPT, and potentially other chatbots, lies in their reliability.Accuracy can vary significantly based on the topic and the complexity of the questions [28].For instance, studies have shown that the accuracy of ChatGPT 3.5 can range from as high as 70% [6] to as low as 43% [27].A critical question then arises: are the responses consistent across different trials at various times?Interestingly, the variability in responses appears to be fairly independent of their correctness.For example, as illustrated in Figure 2, although the percentage of correct responses differs by 20% between versions 3.5 and 4, the percentage of changeable responses is relatively similar -30% for version 3.5 and 26.7% for version 4. These variations in response are important as they might significantly affect an outcome.For instance, with ChatGPT 4, the chance of passing an exam might be as low as 50% or as high as 77%, depending on when the system was queried.This variability underscores the importance of understanding and possibly mitigating inconsistencies so as to enhance the reliability of a particular chatbot application.
Despite the chatbots achieving reasonable accuracy in some areas, their performance is not good enough for professional use or patient support, where nearly perfect accuracy is called for.This study aimed to test repeatability using a set of standardized questions; however, we observed significant errors and variability in responses.This is particularly concerning when ChatGPT generates a narrative or answers an open-ended question without providing reliable sources or, in some cases, citing non-existent references [10].The ability to track sources and verify responses is invaluable, particularly if responses can vary.
Our findings highlight the critical importance of repeatability as well as accuracy.Even if responses are sometimes correct, this is not enough, especially when users do not have the means to verify what is provided, perhaps leading them to be misled.Our study casts a broader light on the utility of ChatGPT and similar AI-powered chatbots, underscoring their considerable variability in responses, which is a significant limitation.Such variability could pose serious challenges in a clinical setting, misleading patients who rely on receiving correct answers.
Previous studies have indicated that ChatGPT and similar chatbots could play useful roles in a number of different healthcare settings, including hearing [29].Potential applications for clinicians include generating reports, creating document templates, aiding clinical decision-making, simplifying complex information for patient communication, and supporting education [30].However, in order to fully realize these capabilities, improvements in ChatGPT's reliability are necessary.

Conclusions
ChatGPT 4 consistently outperforms ChatGPT 3.5 in terms of accuracy and repeatability over time.However, neither version reached satisfactory standards in answering questions related to auditory physiology, highlighting a significant limitation for their use in specialized fields.This study further contributes to a discussion on the repeatability of ChatGPT responses, underlining that repeatability is as vital as accuracy.
The observed high variability in responses from time to time raises concerns about the suitability of ChatGPT for professional applications.It suggests there is a need for improvements to enhance the precision and consistency of these AI systems before they can be reliably integrated into professional settings.

Appendices
Questions used to test ChatGPT: Please provide responses to the following questions, giving only the question number and letter of the appropriate answer.
1.The middle ear with advanced otosclerosis shows compliance that is:

number Question 3
The frequency of the measuring tone for tympanometry in a child aged 3 months should be: (a) 220 Hz (b) 1000 Hz (c) 50 Hz (d) 226 Hz 30 Auditory brainstem responses have the following clinical applications: (a) for newborn hearing screening (b) for hearing threshold testing (c) for hearing threshold testing, newborn hearing screening, and differential diagnosis of hearing disorders (d) only for the diagnosis of retrocochlear disorders

FIGURE 1 :
FIGURE 1: Snapshots of one of the ChatGPT 3.5 (A) and ChatGPT 4 (B) trials.Answers to the first five questions.All answers can be found in the Appendices.

FIGURE 2 :
FIGURE 2: Percentage of 30 questions given correct responses across all 20 trials for two versions of ChatGPT.The bars show percent of questions that were answered correctly on all trials (dark blue), the percent that were sometimes answered correctly and sometimes incorrectly (light blue), and those that were always answered incorrectly (pink).

3 . 4 . 6 . 14 . 15 .
disruption of the ossicular chain: (a) causes an increase in compliance (b) causes a decrease in compliance (c) increases pressure in the eardrum cavity (d) does not change the compliance of the middle ear.The frequency of the measuring tone for tympanometry in a child aged 3 months should be: In a patient with recruitment and sensorineural hearing loss of 50 dB HL, the threshold of the ipsilateral stapedius reflex is: (a) much lowered (b) slightly lowered (c) the same as normal or slightly elevated 2024 Kochanek et al.Cureus 16(5): e59857.DOI 10.7759/cureus.59857(d) no reflex at all. 5.If an ear with normal hearing sensitivity is stimulated and the middle ear muscle reflex is absent contralaterally, then it means that there is: (a) retrocochlear damage (b) central facial nerve palsy on the side of the ear where the probe is located (c) conductive damage in the ear where the probe was placed (d) all of the above.Which of the following factors affects the amplitude of the click evoked otoacoustic emission signal: (a) the condition of the auditory nerve (b) the condition of the outer hair cells (c) the state of the inner hair cells (d) the state of the auditory cortex.7.In sensorineural hearing loss, the amplitude of otoacoustic emissions is: (a) the same as the norm (b) greater than normal (c) less than normal (d) variable over time.8.A DP-gram is: (a) a graph of hearing thresholds as a function of frequency (b) a graph of tinnitus amplitude as a function of frequency (c) a plot of the amplitude of distortion product otoacoustic emissions as a function of frequency (d) a plot of the amplitude of primary tones as a function of frequency.9. Wave V of auditory brainstem responses is generated by: (a) the dorsal cochlear nuclei (b) nuclei of the superior olive complex (c) nuclei of the lateral lemniscus (d) inferior thalamus.10.Wave III of auditory brainstem responses is generated by: (a) dorsal cochlear nuclei (b) nuclei of the superior olive complex (c) nuclei of the lateral lemniscus (d) inferior thalamus.11.Wave I of auditory brainstem responses is generated by: (a) dorsal cochlear nuclei (b) nuclei of the superior olive complex (c) nuclei of the lateral lemniscus (d) the auditory nerve.12.An auditory brainstem response evoked by a tone pip of 500 Hz at 100 dB nHL represents cochlear activity: (a) in the entire cochlea (b) in the basal turn (c) in the apex (d) in the middle 13.The average error of hearing threshold determination in auditory brainstem response testing is: In which of the following types of audiograms does the auditory brainstem response at 500 Hz have the highest frequency specificity for high intensities?In conductive hearing disorders, a graph of the latency-intensity function is: (a) shifted upward with respect to the norm (b) shifted downward with respect to the norm (c) shifted to the right with respect to the norm (d) has the same course as in the norm.16.In retrocochlear hearing disorders, a graph of the latency-intensity function has the following course: (a) has the same waveform as in the norm (b) is shifted upward with respect to the norm (c) is shifted to the right with respect to the norm (d) is shifted downward with respect to the norm.17.For a sensorineural loss of 60 dB in the frequency range of 2000-4000 Hz, a plot of the latency-intensity function for a click stimulus has the following characteristics: (a) it is shifted to the right (c) otoacoustic emissions are almost always present with a type B tympanogram (d) otoacoustic emissions are almost always present with a type Ad tympanogram.30.Auditory brainstem responses have the following clinical applications: (a) for newborn hearing screening (b) for hearing threshold testing (c) for hearing threshold testing, newborn hearing screening, and differential diagnosis of hearing disorders (d) only for the diagnosis of retrocochlear disorders.
5 (free) and ChatGPT 4 (paid), with version 4 being the more advanced.The knowledge cutoff for ChatGPT 3.5 was early 2021, meaning it was last trained on new data around that time.ChatGPT 4, on the other hand, has a knowledge cutoff of April 2023.In practical terms, the cutoffs represent the most recent data included in the training set before the model was finalized.While version 3.5 is free, version 4 requires a subscription (a monthly cost of 24.60 USD at the time we tested it).

Week at which test was done Week 1 Week 2
5 and 87-88% for ChatGPT 4. rmANOVA revealed a significant effect of ChatGPT version (F(1,18) = 57.2,p < 0.001), but no effect of day or interaction of day and ChatGPT version.Cohen's Kappa had average values of 0.67-0.71for ChatGPT 3.5 and 0.81-0.84for ChatGPT 4, meaning substantial agreement for version 3.5 and almost perfect for version 4. In all cases Cohen's Kappa had p<0.0001, indicating that the observed agreement was not accidental.

TABLE 3 : Average repeatability of two versions of ChatGPT responses at each testing day. Percent agreement and Cohen's Kappa calculated between responses (without taking into consideration accuracy of the responses). Mean and standard deviation (in brackets) provided. In all cases Cohen's Kappa had p<0.0001, indicating that observed agreement was not accidental.
The repeatability of the two versions of ChatGPT between different days is given in Table4.It can be seen that the percent agreement between different days was 75-79% for ChatGPT 3.5 and 85-88% for ChatGPT 4.
rmANOVA revealed significant effect of ChatGPT version (F(1,48) = 75.8,p < 0.001), but no effect of day or interaction of day and ChatGPT version.For trials within the same day, average Cohen's Kappa was similar in both versions of ChatGPT.In all cases Cohen's Kappa had p<0.0001, indicating that the observed agreement was not accidental.