Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Jun 28, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Methodological Considerations in Evaluating Large Language Models for Anatomy Education.

Ismail Sivri1, Furkan Mehmet Ozden1, Gamze Gul1

  • 1Department of Anatomy, Faculty of Medicine, Kocaeli University, İzmit, Kocaeli, Türkiye.

Clinical Anatomy (New York, N.Y.)
|June 27, 2026
PubMed
Summary
This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Comparative evaluation of radiological anatomy knowledge and accuracy of ChatGPT-5, Gemini 2.5, and Grok 4 across normal and thinking modes.

Anatomical sciences education·2026
Same author

Comment on 'Evaluating the efficacy and readability of advanced large language models in responding to patients' frequently asked questions about chronic rhinosinusitis: a comparative analysis'.

European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery·2026
Same author

Comment on "Evaluating ChatGPT-4o and DeepSeek-R1 for Patient Education in Nasopharyngeal Carcinoma Radiotherapy: a Comparative Analysis".

Journal of cancer education : the official journal of the American Association for Cancer Education·2026
Same author

Recent trends in medical image segmentation and volumetric analysis in human studies (2000-2025): a bibliometric review.

Surgical and radiologic anatomy : SRA·2026
Same author

Comment on: "Assessing AI-generated patient leaflets on descemet membrane endothelial keratoplasty".

European journal of ophthalmology·2026
Same author

LLM-generated multiple-choice questions in medical education.

Advances in physiology education·2026
Same journal

Fractal Complexity of the Circle of Willis Links Circulating Microparticles to Silent Cerebral Small Vessel Disease.

Clinical anatomy (New York, N.Y.)·2026
Same journal

The Permissible Use of the Pernkopf Atlas: A Single-Case Qualitative Study of the Vienna Protocol.

Clinical anatomy (New York, N.Y.)·2026
Same journal

Donor-To-Recipient Mean Axonal Count Ratios of Upper Limb Nerves Used in Nerve Transfer Surgery: A Systematic Review of Histomorphometric Cadaveric Studies.

Clinical anatomy (New York, N.Y.)·2026
Same journal

AI Educational Engagement Patterns and Their Association With Attitudes Toward AI Integration in Anatomical Science Education: A Cross-Sectional Study.

Clinical anatomy (New York, N.Y.)·2026
Same journal

The Evolving Role of Clinical Anatomy: From Definition to Contribution.

Clinical anatomy (New York, N.Y.)·2026
Same journal

Histological and Tissue-Level Outcomes of Stem Cell Therapies in Neurodegenerative Disorders: A Systematic Review.

Clinical anatomy (New York, N.Y.)·2026
See all related articles

Evaluating large language models (LLMs) in medical education requires careful consideration of methodological factors. Future studies should enhance transparency for improved reproducibility and fairness in LLM assessments.

Area of Science:

  • Medical Education
  • Artificial Intelligence
  • Anatomy

Background:

  • Large language models (LLMs) show potential in anatomy education, clinical decision support, and knowledge assessment.
  • Evaluating LLMs with non-English anatomy questions is crucial due to English-dominated datasets.
  • Comparative LLM studies face methodological challenges impacting reproducibility, fairness, and interpretation.

Purpose of the Study:

  • To highlight critical methodological factors for future large language model evaluation studies.
  • To emphasize the need for transparency and standardized reporting in LLM research.
  • To ensure the reliability and comparability of LLM assessments in medical fields.

Main Methods:

  • Categorization of methodological factors into technical, session-related, and experimental design.
Keywords:
ChatGPTartificial intelligencedata leakagelarge language modelsmedical educationmemory biasreporting guidelinesreproducibility

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Related Experiment Videos

Last Updated: Jun 28, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

  • Identification of specific variables within each category (e.g., model version, prompt wording, scoring).
  • Recommendation for adherence to emerging reporting frameworks like TRIPOD-LLM, CONSORT-AI, and MI-CLEAR-LLM.
  • Main Results:

    • Technical factors include model version, interface, access date, browser, device, OS, connection, and timing.
    • Session-related factors encompass memory, personalization, chat history, custom instructions, and testing session types.
    • Experimental design factors involve dataset source, question selection, prompt wording, input language, response format, attempts, order, scoring, expert evaluation, and inter-rater agreement.

    Conclusions:

    • Detailed methodological reporting is essential for future LLM comparison studies in anatomy, education, and clinical fields.
    • Adopting standardized frameworks will enhance the transparency, reproducibility, and comparability of LLM evaluations.
    • Addressing these factors will strengthen the evidence base for LLM applications in healthcare and education.