Artificial intelligence tools, amidst excitement, sensationalism, and fear, are revolutionizing various aspects of daily life, including medicine. ChatGPT has been explored in medical fields, science writing, and medical education. In this study, we aimed to evaluate ChatGPT’s efficacy in neurosurgical education by integrating it into the annual exam for neurosurgical residents at Geneva University Hospitals. These residents undergo this exam as part of their training, preparing for further certifications.
ChatGPT, an AI model, was used to generate and answer questions in a written exam for neurosurgical residents. The exam contained 50 questions, a mix of multiple-choice and open-ended, with 4 questions generated by ChatGPT and the rest by attending neurosurgeons. The attending neurosurgeons provided correct answers, and some questions included images or specific protocols. Eleven participants, including 10 residents and ChatGPT, took the exam within 60 minutes. ChatGPT answered 42 questions, excluding those with images or institutional protocols. ChatGPT’s questions and answers were assessed based on predefined criteria to evaluate its performance compared to residents and attending neurosurgeons. The assessment focused on how ChatGPT generated and answered questions, its performance compared to residents, and how residents performed on ChatGPT-generated questions.
Generative language models, like ChatGPT, operate in complex, proprietary ways, but the basic principles behind them are public and accessible (Stollnitz, 2023). They’ve been used extensively in various medical fields, such as neurooncology and medical writing, raising concerns about transparency and scientific accuracy (Tools such as ChatGPT threaten, 2023).
A recent study found that ChatGPT can achieve scores equivalent to a third-year medical student on the United States Medical Licensing Examination, with its ability to provide logical and contextual answers being highlighted (Gilson et al., 2023).
However, when asked to generate questions for a neurosurgical resident exam, ChatGPT’s output was less intuitive than expected. The AI-generated questions provided were well-formulated but lacked criteria for determining correct answers.
To refine the process, the human request was redefined, specifying the type of question and correct answers. This resulted in more precise questions, though some required further refinement.
While ChatGPT’s speed in generating questions is unmatched, human oversight is necessary to filter out suitable questions.
Human-generated questions, though potentially less precise, may be better understood by residents due to shared language and training experiences.
Chomsky emphasizes the difference between human and AI minds, stating that humans create explanations rather than merely infer correlations (Chomsky et al., 2023).
AI tools like ChatGPT have significant potential in medicine, aiding in standardization and improving care globally. They can assist in medical education and research by generating answers and processing information from vast databases, reducing time and effort for humans.
Future experiments may compare resident performance in exams generated fully by AI versus those by humans, exploring the effectiveness of each approach.
It’s clear that AI is poised to become a valuable tool in medical education and specific medical tasks, though the exact applications need clarification. Precisely formulating requests for AI to generate logical questions involves detailing content, question type, and correct answers.
Source:
Bartoli, A., May, A. T., Al-Awadhi, A., & Schaller, K. (2023). Probing artificial intelligence in neurosurgical training: ChatGPT takes a neurosurgical residents written exam. Brain & spine, 4, 102715. https://doi.org/10.1016/j.bas.2023.102715