Comparing Vision-Capable Models, GPT-4 and Gemini, With GPT-3.5 on Taiwan’s Pulmonologist Exam

Introduction The latest generation of large language models (LLMs) features multimodal capabilities, allowing them to interpret graphics, images, and videos, which are crucial in medical fields. This study investigates the vision capabilities of the next-generation Generative Pre-trained Transformer 4 (GPT-4) and Google’s Gemini. Methods To establish a comparative baseline, we used GPT-3.5, a model limited to text processing, and evaluated the performance of both GPT-4 and Gemini on questions from the Taiwan Specialist Board Exams in Pulmonary and Critical Care Medicine. Our dataset included 1,100 questions from 2012 to 2023, with 100 questions per year. Of these, 1,059 were in pure text and 41 were text with images, with the majority in a non-English language and only six in pure English. Results For each annual exam consisting of 100 questions from 2013 to 2023, GPT-4 achieved scores of 66, 69, 51, 64, 72, 64, 66, 64, 63, 68, and 67, respectively. Gemini scored 45, 48, 45, 45, 46, 59, 54, 41, 53, 45, and 45, while GPT-3.5 scored 39, 33, 35, 36, 32, 33, 43, 28, 32, 33, and 36. Conclusions These results demonstrate that the newer LLMs with vision capabilities significantly outperform the text-only model. When a passing score of 60 was set, GPT-4 passed most exams and approached human performance.


Introduction
Artificial intelligence (AI) has been widely applied in the field of healthcare over the past years with deep learning, neural networks, and image processing.Notable applications include medical image diagnosis and models predicting mortality rates for specific diseases [1,2].The emergence of large language models (LLMs) such as OpenAI's Chat Generative Pre-trained Transformer (ChatGPT; OpenAI, San Francisco, CA, United States), which debuted in 2022, has opened up a new field of applications in healthcare.
Accuracy and minimal error margins are paramount in medical diagnosis, making it important to evaluate LLMs' effectiveness.Some studies have implemented statistical methods such as receiver operating characteristic curves, precision-recall curves, or confusion matrices for assessment.Alternatively, some have assessed LLMs using real medical examination texts.ChatGPT, as the first LLM, was tested with the text of exams for medical staff, primarily focusing on English text.ChatGPT shows a significant improvement in natural language processing (NLP) in 2023, performing at or near the passing threshold for various medical exams without specialized training [3].It achieves scores equivalent to those of a third-year medical student [4].In non-English texts, ChatGPT's performance varies [5].It has shown proficiency in basic science medical knowledge and applied clinical knowledge [6].
The next generation of language models includes multimodal capabilities, retrieval-augmented generation, and enhanced processing of images, audio, and video.These capabilities are crucial for the medical field, which heavily relies on images and sound.GPT-4 and Gemini, as the next-generation LLMs, offer vision features, handling both text and images.In subspecialties like gynecology, thoracic surgery, radiology, and diagnostic imaging, GPT-4 outperformed GPT-3, but its image processing capabilities are less explored [7][8][9].Since Gemini was released in mid-December 2023, there is no related information available yet [10].
For non-English, ChatGPT exhibits lower scores than medical students in the context of simplified Chineselanguage medical exams [11].Chest medicine, gastroenterology, and general medicine are scored relatively well in medical exams in Taiwan.A key limitation is the reliance on non-English text, which may impact performance due to the model's primary training in English [12].In subspecialties like family medicine in Taiwan, the results were not satisfactory [13].
In Taiwan's pulmonary specialist board exam, key areas such as infectious diseases (e.g., pneumonia and tuberculosis), cancer, respiratory disorders, intensive care, sleep medicine, and esophageal diseases are prominent.Building on this study, we focused on non-English texts, specifically chest subspecialties, and utilized next-generation models with vision features.This approach allowed us to incorporate both textual and graphical data into our research.

Materials And Methods
We sourced pulmonary specialist exam questions and answers from 2013 to 2023 from the Taiwan Society of Pulmonary and Critical Care Medicine (TSPCCM) website [14], categorizing them into text-and image-based sections.Two pulmonologists reviewed and subdivided these 1,100 questions into specific topics: infection (bacterial, fungal, and viral origin), tuberculosis (lung and extrapulmonary origin), esophagus topic, thorax anatomy, sleep, pharmacology, lung neoplasms (lung cancer and other origins in thorax), critical care medicine, pathophysiology, mechanical ventilation and oxygen therapy, interstitial lung disease, surgery, pulmonary embolism and vascular disease, asthma, chronic obstructive pulmonary disease (COPD) including bronchiectasis, lung function test, pulmonary vasculitis, autoimmune disease, sarcoidosis and lymphangioleiomyomatosis, pneumothorax and chylothorax, bronchoscopy and image examination, musculoskeletal disease, tracheal disease, pleura disease, diaphragmatic disease, miscellaneous.
We meticulously organized the pulmonary exam questions into a text file, while separately storing the images.Because these one-answer questions have multiple choices to select, the prefix "Please give me only one answer in a single letter form" will be added in the front part of the question content before being fed to LLM.Combine these two parts to form one prompt to input into the model.
For evaluation, we employed GPT-3.5, which lacks image processing capabilities, alongside GPT-4 Vision and Gemini, both equipped with image functionalities.The text components were analyzed using APIs provided by OpenAI and Gemini.In this study, we used a specific model by setting the model name by API.In ChatGPT, "GPT-3.5-turbo"and "GPT-4" were selected.These two model versions were the same on December 20, 2023.In Gemini, "Gemini-pro" in text-only questions and "Gemini-pro-vision" in questions with both text and images were selected, which was the December 25, 2023 version.For the visual elements in GPT-4, we used the web interface to input text and upload images.The Gemini API, on the other hand, facilitated input for both text and images.The flowchart of the study is presented in Figure 1.

TABLE 1: Scores in text-only and text-and-image questions by year
In the entire set of exam questions, categorized according to the number of questions, the analysis is as follows: for lung neoplasms (lung cancer and other origins in the thorax), there are a total of 172 questions.GPT-4 answered 124 correctly, Gemini answered 91 correctly, and GPT-3.5 answered 62 correctly.In the section on infections (bacterial, fungal, and viral origin), there are 120 questions, with GPT-4, Gemini, and GPT-3.5 scoring 77, 65, and 35 correct answers, respectively.For critical care medicine, there are 98 questions, with the scores being 66, 45, and 34, respectively.In mechanical ventilation and oxygen therapy, there are 91 questions, with the scores being 60, 39, and 28, respectively.For tuberculosis (lung and extrapulmonary origin), there are 71 questions, with the scores being 38, 32, and 26, respectively.In the topic related to the esophagus, there are 64 questions, with scores of 41, 35, and 23, respectively.For asthma, there are 63 questions, with the scores being 40, 27, and 26, respectively.For COPD, including bronchiectasis, there are 63 questions with scores of 33, 30, and 20, respectively.The details of these results are listed in  In the category of questions involving both text and images, the analysis according to the number of questions is as follows: for mechanical ventilation and oxygen therapy, there are a total of 22 questions, with GPT-4 answering 15 correctly and Gemini answering 9.In the sleep category, there are seven questions, with GPT-4 and Gemini scoring 3 and 2, respectively.For other areas, both GPT-4 and Gemini provided correct answers, with details available in Table 3.

TABLE 3: Scores in categories with both text-only and text-with-image questions
We use the total number of questions as the denominator and the number of correct answers as the numerator.Due to the total number of questions being less than 10 and categorized as miscellaneous, we have excluded six categories: pulmonary vasculitis, pleura disease, trachea disease, musculoskeletal disease, diaphragmatic disease, and miscellaneous.Additionally, we have sorted the categories based on the correct answer ratio.With 0.6 as the threshold, the ratios for each category separately for GPT-4, Gemini, and GPT-3.5 are shown in Figure 3. Different preferences for answering questions were observed in these three models.

FIGURE 3: Answer rates in different categories
The dashed line represents the 60% passing threshold.For categories with more than 60 questions, the answer rates from highest to lowest in GPT-4 are as follows: lung neoplasm, critical care medicine, mechanical ventilation, infection, esophageal disease, asthma, tuberculosis, and COPD and bronchiectasis.In Gemini, the order is esophageal disease, infection, lung neoplasm, COPD and bronchiectasis, critical care medicine, tuberculosis, mechanical ventilation, and asthma.For GPT-3.5, it is asthma, tuberculosis, lung neoplasm, esophageal disease, critical care medicine,

Discussion
AI capable of understanding human language has been a focus of research for many decades.Due to the complexity of human language, significant progress in this field remained elusive until the development of the ChatGPT, built on the GPT-3.5 architecture [15].It has been trained with extensive and massive text data from the internet, making it able to comprehend and respond to human language with remarkable accuracy and efficiency [16].
GPT-4 vision ability represents a significant evolution in language models.Traditionally, such models were constrained to text-based inputs, limiting their application scope.GPT-4 incorporates image processing, thereby enhancing the model's utility and applicability across diverse scenarios that require multimodal understanding [17].Gemini, an advanced AI model proposed by Google DeepMind, was introduced on December 6, 2023.It is designed for multimodality, with text, images, videos, audio, and code processing.It also stands out as the first model to surpass human experts in massive multitask language understanding, a key benchmark for AI knowledge and problem-solving [10].
In this study, we categorized the exam questions by topic and observed that the accuracy rates of the three models varied across different subjects.However, for common thoracic conditions such as neoplasms, infections, critical care medicine, and asthma, the accuracy rates were above average, likely due to the abundance of data available for these conditions.Notably, despite the growing evidence linking sleep to various internal medicine conditions, the accuracy rates for this topic were consistently low across all three models.We speculate that this may be due to the general public's lack of awareness of the importance of sleep medicine, leading to insufficient training data provided by the companies.When all available LLMs have insufficient knowledge on a particular topic, there is a risk of misleading the public.This phenomenon warrants further investigation.
The examination questions are distributed over 11 years, from 2013 to 2023; GPT-3.5, GPT-4, and Gemini show no differences in answer trends over varying years.For categories with more than 60 questions, the answer rates also vary in these three LLMs.This may be due to the differences in the training datasets.A professional medical team preparing relevant data for training could be a direction for future medical LLMs.
Interestingly, the next generation of LLMs, in addition to being multimodal with multimedia input and recognition, also incorporates retrieval-augmented generation (RAG).RAG is an NLP framework that combines search retrieval and generative capabilities [18].Through this architecture, models can search for relevant information fetched from external databases and use this information to generate responses or complete specific NLP tasks, therefore enhancing the accuracy and reliability of generative AI models [19].
To understand whether the above answers were generated by the original model training due to its knowledge base being up to date only until January 2022, we inputted "102-year chest and critical care medicine specialist physician examination questions" in Chinese words into the ChatGPT web user interface.GPT-3.5 responded that it could not answer questions about the exam of year 102 (the year 102 in the Taiwan calendar, which corresponds to the year 2013 in the Gregorian calendar).GPT-4 was able to search for information through the integrated Microsoft Bing search engine.It successfully found the exam questions and answers files (in PDF file type) on the TSPCCM website.It responded to the correct file links but subsequently failed to find them in later searches again, indicating possible inconsistencies in search results.Gemini exhibited what is known as the "language model illusion problem," responding with content unrelated to the query [20].
Although GPT-4 and Gemini both possess basic RAG capabilities, they have not yet demonstrated the ability to search medical websites for relevant exam questions and directly analyze corresponding answers.However, we believe that similar capabilities are highly likely to appear in the next generation of language models, making it challenging to assess whether a language model has comprehensive knowledge capabilities.Medical models require highly accurate data and responses from professionals.Due to integration with search engines, the RAG capabilities might introduce problematic information from the internet, leading to issues with answer accuracy.
Over the last year, language models have evolved remarkably due to advancements made by various companies, now featuring capabilities for both text-and image-based analysis.Language models specialized in medical research outperform general models in the medical domain.For instance, Google's Med-PaLM has already demonstrated this superiority [21].However, the latest version, Med-PaLM 2, shows even more significant progress in the US Medical Licensing Exam, scoring 86.2 as opposed to Med-PaLM's 67.2, nearly reaching the expert level [22].According to our study, there is a possibility that current LLMs may have made notable progress in highly specialized areas and non-English domains.Not only GPT-4, but Gemini, upon its release, had already surpassed GPT-3.5.This could be due to Gemini being the successor model to Med-PaLM.

TABLE 2 : Number of correct answers by category
COPD: chronic obstructive pulmonary disease