Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: Jun 28, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Methodological Considerations in Evaluating Large Language Models for Anatomy Education.

Ismail Sivri¹, Furkan Mehmet Ozden¹, Gamze Gul¹

¹Department of Anatomy, Faculty of Medicine, Kocaeli University, İzmit, Kocaeli, Türkiye.

Clinical Anatomy (New York, N.Y.)

|June 27, 2026

Summary

This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Comparative evaluation of radiological anatomy knowledge and accuracy of ChatGPT-5, Gemini 2.5, and Grok 4 across normal and thinking modes.

Anatomical sciences education·2026

Same author

Comment on 'Evaluating the efficacy and readability of advanced large language models in responding to patients' frequently asked questions about chronic rhinosinusitis: a comparative analysis'.

European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery·2026

Same author

Comment on "Evaluating ChatGPT-4o and DeepSeek-R1 for Patient Education in Nasopharyngeal Carcinoma Radiotherapy: a Comparative Analysis".

Journal of cancer education : the official journal of the American Association for Cancer Education·2026

Same author

Recent trends in medical image segmentation and volumetric analysis in human studies (2000-2025): a bibliometric review.

Surgical and radiologic anatomy : SRA·2026

Same author

Comment on: "Assessing AI-generated patient leaflets on descemet membrane endothelial keratoplasty".

European journal of ophthalmology·2026

Same author

LLM-generated multiple-choice questions in medical education.

Advances in physiology education·2026

Same journal

Fractal Complexity of the Circle of Willis Links Circulating Microparticles to Silent Cerebral Small Vessel Disease.

Clinical anatomy (New York, N.Y.)·2026

Same journal

The Permissible Use of the Pernkopf Atlas: A Single-Case Qualitative Study of the Vienna Protocol.

Clinical anatomy (New York, N.Y.)·2026

Same journal

Donor-To-Recipient Mean Axonal Count Ratios of Upper Limb Nerves Used in Nerve Transfer Surgery: A Systematic Review of Histomorphometric Cadaveric Studies.

Clinical anatomy (New York, N.Y.)·2026

Same journal

AI Educational Engagement Patterns and Their Association With Attitudes Toward AI Integration in Anatomical Science Education: A Cross-Sectional Study.

Clinical anatomy (New York, N.Y.)·2026

Same journal

The Evolving Role of Clinical Anatomy: From Definition to Contribution.

Clinical anatomy (New York, N.Y.)·2026

Same journal

Histological and Tissue-Level Outcomes of Stem Cell Therapies in Neurodegenerative Disorders: A Systematic Review.

Clinical anatomy (New York, N.Y.)·2026

See all related articles

Evaluating large language models (LLMs) in medical education requires careful consideration of methodological factors. Future studies should enhance transparency for improved reproducibility and fairness in LLM assessments.

Area of Science:

Medical Education
Artificial Intelligence
Anatomy

Background:

Large language models (LLMs) show potential in anatomy education, clinical decision support, and knowledge assessment.
Evaluating LLMs with non-English anatomy questions is crucial due to English-dominated datasets.
Comparative LLM studies face methodological challenges impacting reproducibility, fairness, and interpretation.

Purpose of the Study:

To highlight critical methodological factors for future large language model evaluation studies.
To emphasize the need for transparency and standardized reporting in LLM research.
To ensure the reliability and comparability of LLM assessments in medical fields.

Main Methods:

Categorization of methodological factors into technical, session-related, and experimental design.

Keywords:

ChatGPT artificial intelligence data leakage large language models medical education memory bias reporting guidelines reproducibility

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Related Experiment Videos

Last Updated: Jun 28, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Identification of specific variables within each category (e.g., model version, prompt wording, scoring).

Recommendation for adherence to emerging reporting frameworks like TRIPOD-LLM, CONSORT-AI, and MI-CLEAR-LLM.

Main Results:

Technical factors include model version, interface, access date, browser, device, OS, connection, and timing.
Session-related factors encompass memory, personalization, chat history, custom instructions, and testing session types.
Experimental design factors involve dataset source, question selection, prompt wording, input language, response format, attempts, order, scoring, expert evaluation, and inter-rater agreement.

Conclusions:

Detailed methodological reporting is essential for future LLM comparison studies in anatomy, education, and clinical fields.
Adopting standardized frameworks will enhance the transparency, reproducibility, and comparability of LLM evaluations.
Addressing these factors will strengthen the evidence base for LLM applications in healthcare and education.