Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Comparing the Survival Analysis of Two or More Groups01:20

Comparing the Survival Analysis of Two or More Groups

734
Survival analysis is a cornerstone of medical research, used to evaluate the time until an event of interest occurs, such as death, disease recurrence, or recovery. Unlike standard statistical methods, survival analysis is particularly adept at handling censored data—instances where the event has not occurred for some participants by the end of the study or remains unobserved. To address these unique challenges, specialized techniques like the Kaplan-Meier estimator, log-rank test, and...
734
  1. Home
  2. Large Language Models Underperform In European General Surgery Board Examinations: A Comparative Study With Experts And Surgical Residents.
  1. Home
  2. Large Language Models Underperform In European General Surgery Board Examinations: A Comparative Study With Experts And Surgical Residents.

Related Experiment Video

Learning Modern Laryngeal Surgery in a Dissection Laboratory
07:30

Learning Modern Laryngeal Surgery in a Dissection Laboratory

Published on: March 18, 2020

8.2K

Large language models underperform in European general surgery board examinations: a comparative study with experts

Melih Can Gül1

  • 1Department of Gastrointestinal Surgery, Afyonkarahisar State Hospital, Afyonkarahisar, Türkiye. opdrmelihcangul@gmail.com.

BMC Medical Education
|August 24, 2025

View abstract on PubMed

Summary
This summary is machine-generated.

Artificial intelligence (AI) models show lower accuracy than human surgeons on board exam questions. AI tools are best used to supplement, not replace, expert clinical judgment in surgical education.

Keywords:
Artificial intelligenceBoard examinationsHuman-AI comparisonMedical educationSurgical training

More Related Videos

Simulator Training for Endovascular Neurosurgery
08:08

Simulator Training for Endovascular Neurosurgery

Published on: May 6, 2020

3.8K
Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

681

Related Experiment Videos

Learning Modern Laryngeal Surgery in a Dissection Laboratory
07:30

Learning Modern Laryngeal Surgery in a Dissection Laboratory

Published on: March 18, 2020

8.2K
Simulator Training for Endovascular Neurosurgery
08:08

Simulator Training for Endovascular Neurosurgery

Published on: May 6, 2020

3.8K
Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

681

Area of Science:

  • Medical Education
  • Artificial Intelligence
  • Surgical Assessment

Background:

  • Artificial intelligence (AI) is increasingly used in medical education and assessment.
  • AI models like GPT-4o show variable performance on high-stakes medical examinations.
  • This study evaluates AI performance against human experts in surgical board testing.

Purpose of the Study:

  • To compare the accuracy of four AI models (Llama-3, Gemini, GPT-4o, Copilot) against specialists and residents on European General Surgery Board questions.
  • To analyze AI performance based on question format, length, and difficulty.
  • To determine the current role of AI in surgical education and assessment.

Main Methods:

  • 120 multiple-choice questions from the General Surgery Examination were used.
  • Four AI models and 30 surgeons (specialists and residents) answered questions under timed conditions.
  • Questions were categorized by length and difficulty, with accuracy compared using ANOVA.
  • Main Results:

    • Board-certified surgeons achieved 81.6% accuracy, residents 69.9%.
    • Llama-3 performed best among AI models at 65.8%, Copilot lowest at 51.7%.
    • AI performance decreased with question length and difficulty, unlike human performance.

    Conclusions:

    • Current large language models (LLMs) underperform human specialists in high-level medical knowledge assessments.
    • LLMs are valuable supplementary tools in surgical education.
    • AI should not replace expert clinical judgment in surgical practice.