A Blinded Comparison of Three Generative Artificial Intelligence Chatbots for Orthopaedic Surgery Therapeutic Questions

Objective To compare the quality of responses from three chatbots (ChatGPT, Bing Chat, and AskOE) across various orthopaedic surgery therapeutic treatment questions. Design We identified a series of treatment-related questions across a range of subspecialties in orthopaedic surgery. Questions were "identically" entered into one of three chatbots (ChatGPT, Bing Chat, and AskOE) and reviewed using a standardized rubric. Participants Orthopaedic surgery experts associated with McMaster University and the University of Toronto blindly reviewed all responses. Outcomes The primary outcomes were scores on a five-item assessment tool assessing clinical correctness, clinical completeness, safety, usefulness, and references. The secondary outcome was the reviewers’ preferred response for each question. We performed a mixed effects logistic regression to identify factors associated with selecting a preferred chatbot. Results Across all questions and answers, AskOE was preferred by reviewers to a significantly greater extent than both ChatGPT (P<0.001) and Bing (P<0.001). AskOE also received significantly higher total evaluation scores than both ChatGPT (P<0.001) and Bing (P<0.001). Further regression analysis showed that clinical correctness, clinical completeness, usefulness, and references were significantly associated with a preference for AskOE. Across all responses, there were four considered as having major errors in response, with three occurring with ChatGPT and one occurring with AskOE. Conclusions Reviewers significantly preferred AskOE over ChatGPT and Bing Chat across a variety of variables in orthopaedic therapy questions. This technology has important implications in a healthcare setting as it provides access to trustworthy answers in orthopaedic surgery.


Introduction
Artificial intelligence (AI) and machine learning are transforming scientific research and healthcare.Specifically, generative AI is a form of AI that creates new content based on patterns and information learned from input training data [1].Current generative large-language models (LLMs) have been trained on massive corpora of text such as common crawl -a dataset of 250 billion webpages -and thus have both a general knowledge of the world and the capacity to recapitulate human language [2].Chat Generative Pretrained Transformer (ChatGPT) is one of the most popular generative AI chatbots and the fastest-growing consumer application in history, reaching over 100 million active users just two months after launch in January 2022 [3,4].Since ChatGPT, there have been several other publicly available LLMs, including its successor GPT4, Anthropic's Claude, and Google's Bard.
Generative AI has garnered significant interest for its capacity to automatically respond to medical questions, standardized medical tests, and medical licensing exams.In orthopaedics, the use of AI has seen a 10-fold increase since 2010, according to a systematic review published in 2018 [5].Chatbots also have many other potential applications in orthopaedics, including helping in education, suggesting medical treatment, and performing case analyses during surgery [6].However, in meeting this goal, chatbots must be shown to provide correct responses.In this respect, newer chatbot services have attempted to improve their reliability and transparency by combining existing LLMs with traditional search engines.For example, Bing Chat leverages GPT4 to integrate the information from Bing's top search results into a referenced answer.Similarly, AskOE is another chatbot connected to a database of published orthopaedic randomized controlled trials.In doing so, it promises to provide trustworthy and referenced answers based solely on current clinical research.
It remains unclear which of the more commonly used chatbots aligns best with the needs of physicians in the field of orthopaedics.The current study compares the quality and comprehensiveness of responses from three chatbots (ChatGPT, Bing, and AskOE) across a range of orthopaedic surgery therapeutic treatment questions.

Materials And Methods
For this cross-sectional study, we performed a blinded comparison of three generative AI chatbots on a series of therapy questions.Reviewers assessed the quality, comprehensiveness, correctness, and usefulness of the responses using a standard rubric previously published [7].

Description of chatbots
Introduced in 2022, ChatGPT (GPT3.5) is a proprietary autoregressive transformer-based LLM developed by OpenAI based in San Francisco, California, USA.While the exact details of its construction are unknown, it is likely a larger extension of the InstructGPT and GPT3 frameworks, with additional fine-tuning using reinforcement learning from human feedback [2,4,8,9].The free version of ChatGPT (i.e., GPT3.5-Turbo) was used for answer generation.
Bing Chat is a proprietary chatbot service developed and freely deployed by Microsoft located in Redmond, Washington, USA.Bing Chat is based on a version of GPT4, the successor to ChatGPT [10].Unlike ChatGPT, it is indexed to the Bing search engine, allowing it to inform its answer with web searches [11].As such, Bing Chat is natively able to source its generated answers with web pages used to derive its answers.The balanced version of Bing Chat was used in answer generation.
AskOE is a proprietary chatbot service based on GPT 3.5-turbo which is indexed to the OrthoEvidence database.The OrthoEvidence database is a proprietary collection of human-extracted and validated data from over 10,000 published randomized controlled trials in the field of orthopaedics.In addition to extracting data, OrthoEvidence summarizes the key findings of published works into clinical summaries.As such, AskOE informs and references its answers from high-quality human-annotated summaries of published randomized controlled trials.

Identification of therapy articles
A selection of 25 questions related to orthopaedic surgery therapies were identified from reviews of the recent randomized control trials and meta-analyses between 1997 and 2023.A random sample of questions was selected across the following themes to ensure generalizability across subspecialty fields in orthopaedic surgery: upper extremity, foot and ankle, trauma, sports medicine, hip and knee arthroplasty, medical management, spine, and osteoarthritis.For example, questions were framed as follows: "Are multi or single injections of platelet-rich plasma for knee osteoarthritis more effective?"The full list of questions is shown in Table 4 in Appendices.

Querying the chatbots
The questions were inputted exactly as written into fresh sessions for each of the three chatbots and the first response from the chatbot was saved and documented.For each question, the order of the three chatbots was randomized and blinded.The order of questions that were presented was also randomized.All responses from chatbots were stripped of any identifying information and their format (font, size, etc.) and citations (if applicable) were standardized to remove any bias.Chatbot responses were labelled as "Response A," "Response B," and "Response C" when presented to reviewers in an online survey database, Google Forms.The reviewers were aware that each response was generated by a different chatbot using generative AI and were made aware of the original question provided to each chatbot.However, reviewers were unaware of the names of the three individual chatbots being tested.

Reviewers
We identified six reviewers who met the following eligibility criteria: (1) Domain expertise in orthopaedic surgery (at least 10 years); (2) Formal degree (MSc or PhD) in the critical appraisal of evidence; and (3) Lack of familiarity with the chatbots based on screening questions about their prior use of generative AI chatbots and preferences.

Outcome assessment
The primary outcome of this study included a four-item assessment tool with each item ranked from 0 (poor) to 100 (best) [7].The reviewers provided a score for each of the following four variables: clinical correctness, clinical completeness, safety, and usefulness.We added a fifth item, References, as a separate measure of evaluation based on initial outcomes assessment feedback from our six expert reviewers.The definitions of each variable are highlighted in Table 5 in Appendices.Reviewers were also provided with definitions of all variables.
As a secondary outcome, reviewers were also asked to choose their overall "preferred" response for each question.

Statistical analysis
All statistical comparisons were conducted using R (v4.2.2;The R Foundation for Statistical Computing, Vienna, Austria) and were considered statistically significant at a P<0.05.The comparison of mean scores for each variable in Table 5 in Appendices was conducted using analysis of variance (ANOVA).Mean and standard error values were reported for each chatbot's scoring on the assessed variables, along with a corresponding P-value for the ANOVA.Post-hoc Tukey-Kramer tests were conducted for any statistically significant ANOVA result to determine which chatbots had significantly different scores for each variable.
A mixed-effects logistic regression was conducted to determine the variables most associated with selecting a preferred chatbot.The selection of the most preferred chatbot (AskOE) was assessed as the dependent variable, categorized as "chosen as preferred chatbot" vs "not chosen as preferred chatbot."Each variable within Table 5 in Appendices was assessed as a fixed effects independent variable.The responder and question were included as random effects variables within the model.Results were reported as odds ratios (ORs), with corresponding 95% confidence intervals and P-values.The marginal R 2 was reported for the mixed effects model, indicating the variance explained by the fixed effects variables within the model.Mixed effects modelling was conducted using the lme4 package in R.

Results
Overall, 150 separate evaluations were made across six blinded, expert reviewers, for a broad range of therapy questions in orthopaedic surgery.Agreement between reviewers across the five separate items was good, with an intraclass correlation coefficient (ICC) of 0.71, ranging from 0.49 to 0.86.

Reviewer's preferred chatbot
AskOE was chosen as the preferred response to a significantly greater extent than either ChatGPT (93 vs 26 votes, 62% vs 17%, P<0.001) and Bing (93 vs 31 votes, 62% vs 21%, P<0.001; Figure 2).We did not identify any difference in endorsement between ChatGPT and Bing (Figure 2).Regression analysis showed that clinical correctness (OR: 1.23, 95% CI, P<0.001), clinical completeness (OR: 1.41, 95% CI, P<0.001), usefulness (OR: 1.36, 95% CI, P<0.001), and references (OR: 1.20, 95% CI, P=0.003) were all significantly associated with preference for AskOE over ChatGPT and Bing (Table 3).We identified four instances in which chatbot responses were considered major errors (see Table 7 in Appendices).Three occurred with using ChatGPT, which involved a clear and incorrect focus of answers from the questions or a lack of answers at all.One error occurred with AskOE, in which five of the total 11 references given were irrelevant to the question.

TABLE 3: Variables Associated With Choosing AskOE as the Preferred AI Chatbot
Mixed-effects regression of predictors for AskOE was selected as the favourable response for 150 observations.There is a marginal R 2 value of 0.797.Predictors were considered significant if P<0.05.

Discussion
In this study, reviewers were asked to blindly score responses to a variety of orthopaedic questions from three different chatbots (ChatGPT, Bing, and AskOE).AskOE was preferred three-fold more frequently than either ChatGPT or Bing Chat.
In their native form, LLMs attempt to provide reasonably-sounding answers based on information they were trained on.As such, the content of the response, while sounding reasonable, may be incorrect or misleading [1].Similarly, in generating responses, chatbots cannot often reliably reference sources from which they derived their response, and as such, some information presented as factual can come from less trustworthy sources such as online blogs [1].Specifically, when asked for references, ChatGPT-generated articles only had 7% authentic references, with the rest being factually incorrect [12].This poses dangers in healthcare, as inaccurate information can negatively affect patient outcomes.Thus, rather than relying on LLMs to "remember" information from their training data, an alternative approach is to ask LLMs to search for the correct answer using an external repository of information [13].This approach, used by Bing Chat as well as AskOE, advantageously allows for the curation of data.As such, AskOE's improved performance in the quality and comprehensiveness of responses (including references) may be the result of the database of randomized trials from which it synthesized its responses.Given that validity and trustworthiness are paramount in medical practice, we saw that drawing information from focused datasets of high-quality data performs better than a broader search engine approach (i.e., Bing Chat).
While, to our knowledge, the specific question of whether chatbots can be used to support orthopaedic surgeons has not been previously explored, previous work has explored the utility of chatbots for supporting patients.Kuroiwa et al. assessed the potential for ChatGPT to diagnose common orthopaedic conditions [14].They found that the accuracy and reproducibility of responses were inconsistent, and few answers included strong recommendations to seek medical attention.Although a direct comparison cannot be made, this generally aligns with our results, as ChatGPT had the lowest performance in our testing.This supports the idea that ChatGPT is not as reliable a source of orthopaedic information.Similarly, Dwyer et al. evaluated the use of a novel AI chatbot for hip arthroplasty patients following surgery [15].It was found that the chatbot handled 79% of questions appropriately, either by addressing the question itself or directing the question to a healthcare professional.Independently, it was able to address the question 31% of the time [15].
Our study has several strengths.First, all responses were blinded, and any identifying information in the responses was scrubbed to ensure there was no bias towards a particular chatbot.Additionally, to mitigate any order effects, the order of the chatbot responses was also randomized.Thirdly, we aimed to include diverse perspectives by having six expert reviewers evaluate the responses.These experts brought varied experiences and insights to the evaluation process, contributing to a more comprehensive understanding of the strengths and weaknesses of each chatbot.Last, the use of a five-item rubric for response evaluation allowed us to systematically assess the chatbot responses, revealing the factors that significantly influence reviewers' preferences.
Nevertheless, there are a few limitations in this study.First, we did not investigate how an AI chatbot could support physicians in their work.Future studies could further investigate surgeons' perceptions of AI to comprehensively understand the impact of these tools in a real-world medical setting.Second, while we tried to sample a robust representative sample of the orthopaedics literature, the relatively small number of questions may have limited the range of complexity for the topics.Nevertheless, our range of questions was broad enough to identify important differences in perceptions across chatbots.Finally, while AskOE did have one example which we classified as a mistake, it did not answer the question incorrectly.Rather, it summarized extraneous information that was not related to the question.We speculate this resulted from the chatbot considering articles that were not directly relevant to the answer.Overall, we believe this has limited danger to misinform the user.

Conclusions
In conclusion, we showed that AskOE performed significantly better than Bing Chat and ChatGPT in providing clinically relevant responses to practicing orthopaedic surgeons.Specifically, AskOE received higher total scores and was preferred by reviewers to a significantly greater extent than both Bing Chat and ChatGPT.Further analysis showed that clinical correctness, clinical completeness, usefulness, and references were significantly associated with a preference for AskOE.
Rapid access to trustworthy answers in orthopaedic surgery has important implications at the bedside, in the operating room, and in the follow-up of patients following surgery.Ensuring high-quality data sources keep up with the pace of novel innovations in generative AI will remain an important facet of the usefulness of surgical chatbots.

AskOE
The use of dexamethasone as an adjuvant in fascia iliaca compartment block (FICB) for surgeries and hip fracture has been studied in several studies.In [Ref1], a study on patients undergoing shoulder arthroscopy, it was found that the addition of dexamethasone to low concentration local anesthetics in supraclavicular brachial plexus nerve block significantly increased analgesia duration and motor blockade [Ref1].However, no significant differences were observed between low and high doses of dexamethasone when added to bupivacaine.In [Ref2], patients scheduled for total knee arthroplasty received dexamethasone in femoral nerve block.It was found that dexamethasone significantly increased the duration of the sensory block by 7 hours compared to saline [Ref2].However, there was no significant reduction in morphine consumption over the first 24 hours.In [Ref3], patients with femoral neck, intertrochanteric, or sub-trochanteric fractures received a pre-operative fascia iliaca compartment blockade (FICB).The study found that while the FICB group had significantly greater patient satisfaction, there were no significant differences in acetaminophen, tramadol, and morphine consumption between the FICB and control groups [Ref3].Overall, the use of dexamethasone as an adjuvant in fascia iliaca compartment block for surgeries and hip fracture has shown mixed results.While it may increase the duration of the sensory block and patient satisfaction, it does not consistently reduce morphine consumption.Further research is needed to fully understand the benefits and limitations of using dexamethasone in FICB.

TABLE 1 : Quality and Correctness of Responses Between ChatGPT, AskOE, and Bing Data
has been represented as mean±standard error (SE).* indicates a 100-point scale, with 100 being the maximum score.

TABLE 2 : Head-to-Head Comparisons Between ChatGPT, AskOE, and Bing
Pairwise post hoc statistical tests were performed, with comparisons considered significant if P<0.05.

TABLE 6 : Responses From Three Chatbots to the Question "How is dexamethasone used as an adjuvant to fascia iliaca compartment block for surgeries of hip fracture?" Chatbots Types Description of Errors ChatGPT
Incorrect focusDiscussed efficacy of treatment method for incorrect region (e.g.question asked about elbow but the answer discussed the shoulder)AskOEIrrelevant reference Five of 11 references did not directly relate to response or questionChatGPTNo answer provided Does not have information on topic, so referred reader to a medical professional or stating that it is a subject of debate ChatGPT Incorrect focus Did not address question, discussing a more broad/general topic than requested