Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology

Background Artificial intelligence (AI) is evolving in the medical education system. ChatGPT, Google Bard, and Microsoft Bing are AI-based models that can solve problems in medical education. However, the applicability of AI to create reasoning-based multiple-choice questions (MCQs) in the field of medical physiology is yet to be explored. Objective We aimed to assess and compare the applicability of ChatGPT, Bard, and Bing in generating reasoning-based MCQs for MBBS (Bachelor of Medicine, Bachelor of Surgery) undergraduate students on the subject of physiology. Methods The National Medical Commission of India has developed an 11-module physiology curriculum with various competencies. Two physiologists independently chose a competency from each module. The third physiologist prompted all three AIs to generate five MCQs for each chosen competency. The two physiologists who provided the competencies rated the MCQs generated by the AIs on a scale of 0-3 for validity, difficulty, and reasoning ability required to answer them. We analyzed the average of the two scores using the Kruskal-Wallis test to compare the distribution across the total and module-wise responses, followed by a post-hoc test for pairwise comparisons. We used Cohen's Kappa (Κ) to assess the agreement in scores between the two raters. We expressed the data as a median with an interquartile range. We determined their statistical significance by a p-value <0.05. Results ChatGPT and Bard generated 110 MCQs for the chosen competencies. However, Bing provided only 100 MCQs as it failed to generate them for two competencies. The validity of the MCQs was rated as 3 (3-3) for ChatGPT, 3 (1.5-3) for Bard, and 3 (1.5-3) for Bing, showing a significant difference (p<0.001) among the models. The difficulty of the MCQs was rated as 1 (0-1) for ChatGPT, 1 (1-2) for Bard, and 1 (1-2) for Bing, with a significant difference (p=0.006). The required reasoning ability to answer the MCQs was rated as 1 (1-2) for ChatGPT, 1 (1-2) for Bard, and 1 (1-2) for Bing, with no significant difference (p=0.235). K was ≥ 0.8 for all three parameters across all three AI models. Conclusion AI still needs to evolve to generate reasoning-based MCQs in medical physiology. ChatGPT, Bard, and Bing showed certain limitations. Bing generated significantly least valid MCQs, while ChatGPT generated significantly least difficult MCQs.


Introduction
A precise definition of artificial intelligence (AI) is not available [1]. Still, AI refers to designing and developing computer systems that emulate human cognitive functioning to solve problems or conduct complex tasks. AI is currently the subject of extensive research across various scientific fields and is rapidly advancing in healthcare and biomedical research [2,3]. AI's partial cognitive abilities include but are not limited to problem-solving, reasoning, and decision-making [4,5]. A noteworthy AI-based conversational tool, ChatGPT, is freely available for research. The other AIs are Microsoft Bing and Google Bard (currently in the experimental phase).

Module Name
Prompts used for MCQs generation Three AI models, namely, ChatGPT, Bard, and Bing, simultaneously provided responses on June 2, 2023. The first response from each AI was taken as the final, and no regeneration option was used. The MCQs generated by the AI models were copied and pasted into an Excel spreadsheet for subsequent analyses.
The two physiologists, who provided the competencies, were tasked with rating the three sets of 110 MCQs each on three parameters -validity, difficulty, and reasoning ability. As shown in Table 2, the rating scale ranged from 0 to 3, where 0 indicated the lowest and 3 indicated the highest validity, difficulty, or reasoning ability.

Parameter Rating
MCQ is valid (acceptable) for the competency of the medical physiology subject. Moreover, the stem and  The physiologists were blinded, which meant they were unaware of which AI model generated each set of MCQs. After a week of analysis, the physiologists returned the results to the third physiologist, who compiled the data and applied statistical methods for further analysis. The ratings provided by the two physiologists were averaged to obtain the final results for validity, difficulty, and reasoning ability. Additionally, Turnitin software was used to check the originality of the content for all sets of MCQs. Figure 1 briefly outlines the method used in the study.

Statistical analysis
We initially entered the data using Microsoft Excel 365 and then performed the statistical analysis in IBM SPSS Statistics for Windows Version 27.0. Since the data were ordinal, we employed non-parametric tests. We presented the data as medians with an interquartile range (Q1-Q3). To compare the distribution across the total and module-wise responses, we utilized the independent sample Kruskal-Wallis test, followed by a post-hoc test, for pairwise comparisons. We assessed the agreement in scores between the two raters using Cohen's Kappa (Κ). We determined the statistical significance by a p-value <0.05.

Results
ChatGPT and Bard provided 110 MCQs for 22 chosen competencies from the 11 modules of the NMC CBME curriculum for physiology. However, Bing provided only 100 MCQs as it failed to generate them for two competencies -physiology of blood grouping and ionic and molecular basis of skeletal muscle contraction. For these two competencies, Bing was given a '0' rating for all three parameters -validity, difficulty, and reasoning ability. Overall, Bing generated the least valid MCQs, and ChatGPT generated the least difficult MCQs, as shown in Table 3.

N= 110
ChatGPT   Although the timings were not recorded, ChatGPT seemed to be the slowest among the three AIs to generate MCQs. Bard generated 47 MCQs in which stem had the suffix 'Which of the following is the most important,' and 54 MCQs had 'all of the above' as an option. Bard was the only AI that provided answers to the MCQs with an explanation. In all the generated MCQs by Bing, the stem of the questions consistently had the suffix 'Which of the following is not,' indicating a negative verb construction.

Discussion
The current study assessed the applicability of ChatGPT, Bard, and Bing in generating valid, difficult, and reasoning-based MCQs in medical physiology. The results show that ChatGPT generated the most valid MCQs but the least difficult ones. None of the three AIs could generate a considerable number of MCQs that required a high level of subject understanding (reasoning ability).
During the analysis of MCQs, we identified a few shortcomings. The MCQs created by Bing and Bard did not fulfill the criteria of an ideal MCQ. A good MCQ should not contain negative words in the stem and 'all of the above' as an option in the choices [14,15]. Moreover, the stem of MCQ asking for 'most important' often emphasizes on factual knowledge rather than comprehension. Moreover, it was observed that ChatGPT generated MCQs with the lowest text similarity index, while Bing had the highest text similarity index.
Among the three AIs, a significant difference in a few scores across different competency modules was observed, which could be attributed to the limited training of the AI systems. The choice of words for prompts could be another reason for the different levels of AI performance. There is scope for further improvement in AI models to make them more suitable for educational use.
Recent studies conducted by medical professionals in India have shown that the ChatGPT is a reliable tool for solving problems that require higher-level thinking, interpretation, analysis, evaluation, or formulation of opinions and predictions based on evidence in pathology, biochemistry, and microbiology [8][9][10]. In addition, ChatGPT was proficient in answering straightforward queries seeking factual information in microbiology [8]. Another study evaluated the capabilities of ChatGPT in answering NMC CBME-based medical physiology question paper of 100 marks that included two essays (15 marks each), 10 short notes (5 marks each), and 20 MCQs. Overall, ChatGPT performed exceptionally well in the Physiology University Examination, achieving a distinction by obtaining more than 75% of the marks [16].
Other previous studies investigating the potential of ChatGPT for medical education applications have reported that ChatGPT not only possesses a remarkable ability to provide accurate responses to medical inquiries, but also its performance was comparable to that of a third-year medical student in the United States [11]. ChatGPT not only exhibited the capacity to pass the USMLE independently without any human assistance but also showcased comprehensible reasoning and provided valid clinical insights in its responses [11,12]. ChatGPT performed well at the German state licensing exam level in Progress Test Medicine by correctly answering two-thirds of the MCQs, outperforming most medical students in their first to third year of study [17]. According to the assessment of 33 physicians spanning 17 specialties, ChatGPT mostly generated accurate and complete information in response to 284 varied medical questions [18].
However, in the current study, when tasked with creating reasoning-based MCQs that require a deep understanding of physiology, the capabilities of AI fall short of human intelligence. Comparable to our study results, a recent study indicated that ChatGPT's intelligence was lower than the Korean medical students for the parasitology examination [19].
The integration of technology into medical education presents exciting opportunities for innovation. AI can offer answers and explanations related to medical topics in an easily accessible and comprehensible manner. Refraining from considering the use of AI in medical education is no longer a practical choice. The focus has shifted from whether students and faculty will use AI to when and how they will use it [20].
We suggest that medical colleges should take advantage of AI and curate carefully designed, developed, and validated AI systems to extract accurate and trustworthy information. The utilization of automatically generated reasoning-based MCQs by AI during live lectures has the potential to revolutionize medical education. This approach would improve the interaction between teachers and students as both would encounter the questions simultaneously, which could transform the lecture into a more dynamic and engaging learning experience.
However, despite AI's current advantages and future potential, ensuring the accuracy and reliability of the information provided by these systems is still a primary challenge. At present, Bard and ChatGPT outperform Bing in creating valid MCQs, but the same cannot be said for the future as AIs are evolving rapidly. A recent study has indicated that ChatGPT was superior to Bard in answering higher-order questions for neurosurgery oral board preparation [21]. Compared to Bard and Bing, ChatGPT produced more accurate and consistent responses to non-expert queries about lung cancer prevention, screening, and terminology [22].

Limitations
This study had several limitations. We solely focused on evaluating the applicability of AI systems to generate MCQs related to medical physiology. Therefore, the findings and conclusions drawn from this study might not be applicable or generalizable to other subjects or domains. Another limitation of the study is the reliance on a single user for conversational interactions with the ChatGPT. Conversations with different users may yield varying responses, particularly at different time points. Additionally, we acknowledge that paraphrasing questions can introduce variations in the responses provided by ChatGPT, which could impact the overall evaluation of the system's performance. The scoring of the AI-generated responses in our study was subjective, as it relied on human evaluators. Despite our efforts to mitigate bias by averaging the scores of the two evaluators, the presence of evaluation bias cannot be completely ruled out. Finally, we could not involve MBBS students in the item analysis of the MCQs [23].

Conclusions
Our study provides valuable insights into the applicability of AI systems to generate reasoning-based MCQs in the medical physiology domain. The results of this study indicate that AI's abilities to create reasoningbased MCQs are not yet fully developed. All three AI models, namely, ChatGPT, Bard, and Bing, showed certain limitations, and there is still room for improvement. ChatGPT and Bard were marginally better than Bing in MCQ creation. However, the current study results should be interpreted with caution because of the limitations of the research design. Future studies should address the present study's limitations and expand the investigation's scope to yield more robust and generalizable findings.

General Physiology
Which of the following is not a type of passive transport? Which of the following transport mechanisms requires the use of energy? Control heart rate and blood pressure C) Stimulate digestion and elimination D) All of the above Which of the following statements accurately describes the effect of the parasympathetic nervous system on the heart rate? a) Parasympathetic stimulation increases heart rate. b) Parasympathetic stimulation decreases heart rate. c) Parasympathetic stimulation has no effect on heart rate. d) Parasympathetic stimulation causes irregular heart rate.
Which of the following is not a function of the sympathetic nervous system? a) Increased heart rate b) Bronchoconstriction c) Pupil dilation d) Increased blood pressure Which of the following divisions of the autonomic nervous system is responsible for the "fight-orflight" response? A) Sympathetic nervous system B) Parasympathetic nervous system C) Enteric

) Convection d) Evaporation
Which of the following is the most important factor that regulates body temperature? A) The hypothalamus B) The thyroid gland C) The adrenal glands D) All of the above Which of the following physiological mechanisms is primarily responsible for maintaining core body temperature during exposure to cold environments? a)

Shivering thermogenesis b) Vasodilation c) Sweating d) Piloerection
Which of the following is not a mechanism of heat production? a) Metabolism b) Shivering c) Sweating d)

Hormonal regulation
Which of the following mechanisms is responsible for heat loss from the body? A) Radiation B)

Conduction C) Convection D) Evaporation
In a fever response, which of the following   To ensure anonymity and enable blind peer review, the registered user's name has been blurred. To ensure anonymity and enable blind peer review, the registered user's name has been blurred

Additional Information Disclosures
Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue.

Conflicts of interest:
In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.