Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Augmenting large language models with clinical knowledge graph for personalized perioperative fluid therapy question answering.

PLOS digital health·2026
Same author

PEPRKD-depression: A knowledge database supporting evidence-based personalized exercise prescription recommendations in depression.

Digital health·2026
Same author

Biobanking for intelligent medicine: assessment and evaluation with the SHARE principle.

Journal of the American Medical Informatics Association : JAMIA·2026
Same author

MSICKB: A Curated Knowledgebase for Exploring Molecular Heterogeneity and Biomarker Prioritization in Microsatellite Instability Cancers.

Computational and structural biotechnology journal·2026
Same author

A Dataset for Evaluating Large Language Models on Chinese National Medical Licensing Examinations.

Scientific data·2026
Same author

From policy to practice: Evaluating the global implications of the FDA's PDURS framework.

Digital health·2026
Same journal

Evaluating the Evidence Base for New Mental Health Tech With APA Labs.

Journal of medical Internet research·2026
Same journal

Radiomics-Based AI for the Diagnosis and Prognosis of Vessels Encapsulating Tumor Clusters in Hepatocellular Carcinoma: Systematic Review and Meta-Analysis.

Journal of medical Internet research·2026
Same journal

Development and Validation of an Explainable Machine Learning Model to Assess the Prevalence Probability of Gastrointestinal Heat Retention Syndrome in Children: Cross-Sectional Study.

Journal of medical Internet research·2026
Same journal

Assessment of a Digital Health Platform Using Web Analytics and User Experience Measurements: Quantitative Study Based on RE-AIM.

Journal of medical Internet research·2026
Same journal

Sensor-Based Monitoring of Knee Osteoarthritis Symptoms in Free-Living Settings: Scoping Review.

Journal of medical Internet research·2026
Same journal

Effects of Immersive Virtual Reality Interventions on Symptom Management in Patients With Gastrointestinal Cancer: Systematic Review and Meta-Analysis of Randomized Controlled Trials.

Journal of medical Internet research·2026
See all related articles

Related Experiment Video

Updated: May 23, 2026

Implementation of In Vitro Drug Resistance Assays: Maximizing the Potential for Uncovering Clinically Relevant Resistance Mechanisms
08:46

Implementation of In Vitro Drug Resistance Assays: Maximizing the Potential for Uncovering Clinically Relevant Resistance Mechanisms

Published on: December 9, 2015

Benchmarking Large Language Models and Prompt Engineering Strategies in Microsatellite Instability Cancers:

Yuxin Zhang1, Jie Song1, Cheng Bi1

  • 1Department of Medical Oncology, Institutes for Systems Genetics, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, No 2222 Xinchuan Road, Gaoxin District, Chengdu, Sichuan, 610000, China, 86 15995854635, 86 28 61528682.

Journal of Medical Internet Research
|May 21, 2026
PubMed
Summary
This summary is machine-generated.

Large language models (LLMs) struggle with microsatellite instability (MSI) cancer tasks. Retrieval-augmented generation (RAG) significantly improves accuracy and safety, but requires optimized retrieval and knowledge bases for reliable clinical AI.

Keywords:
LLMbenchmarkcancerlarge language modelmicrosatellite instabilityprompt engineering

More Related Videos

Integration of Wet and Dry Bench Processes Optimizes Targeted Next-generation Sequencing of Low-quality and Low-quantity Tumor Biopsies
13:24

Integration of Wet and Dry Bench Processes Optimizes Targeted Next-generation Sequencing of Low-quality and Low-quantity Tumor Biopsies

Published on: April 11, 2016

Comparative Lesions Analysis Through a Targeted Sequencing Approach
08:16

Comparative Lesions Analysis Through a Targeted Sequencing Approach

Published on: November 5, 2019

Related Experiment Videos

Last Updated: May 23, 2026

Implementation of In Vitro Drug Resistance Assays: Maximizing the Potential for Uncovering Clinically Relevant Resistance Mechanisms
08:46

Implementation of In Vitro Drug Resistance Assays: Maximizing the Potential for Uncovering Clinically Relevant Resistance Mechanisms

Published on: December 9, 2015

Integration of Wet and Dry Bench Processes Optimizes Targeted Next-generation Sequencing of Low-quality and Low-quantity Tumor Biopsies
13:24

Integration of Wet and Dry Bench Processes Optimizes Targeted Next-generation Sequencing of Low-quality and Low-quantity Tumor Biopsies

Published on: April 11, 2016

Comparative Lesions Analysis Through a Targeted Sequencing Approach
08:16

Comparative Lesions Analysis Through a Targeted Sequencing Approach

Published on: November 5, 2019

Area of Science:

  • Artificial Intelligence in Oncology
  • Clinical Decision Support Systems
  • Biomedical Natural Language Processing

Background:

  • General-purpose large language models (LLMs) have uncharacterized reliability for complex clinical tasks in specialized domains like microsatellite instability (MSI) cancers.
  • The lack of a domain-specific benchmark for evaluating LLM capabilities in MSI oncology poses risks to patient safety.

Purpose of the Study:

  • To develop and validate the Microsatellite Instability Cancer Benchmark (MSIC-Bench) for evaluating LLMs in MSI oncology.
  • To systematically assess LLM performance across prompting strategies and identify areas for improvement.

Main Methods:

  • Developed MSIC-Bench, a 511-question benchmark from clinical guidelines and curated knowledge.
  • Evaluated three state-of-the-art LLMs (GPT-4o, Gemini 2.5 Pro, Claude Opus 4) using four prompting strategies (vanilla, chain-of-thought, reflection of thoughts, RAG).
  • Assessed performance based on accuracy, safety, error composition, and token usage across multiple-choice and open-ended modalities.

Main Results:

  • LLMs exhibited a 'scaffolding effect,' with accuracy decreasing in open-ended scenarios.
  • Retrieval-augmented generation (RAG) was the most effective intervention, shifting bottlenecks from knowledge deficits to retrieval failures.
  • RAG improved accuracy and safety by reducing fabrications, though it introduced a trade-off with false refusals. Hybrid-RAG showed robust performance.

Conclusions:

  • Current LLMs lack specialized knowledge for MSI oncology; RAG is crucial for addressing this gap.
  • Optimizing RAG requires focusing on retrieval precision and high-quality knowledge bases for trustworthy clinical AI.
  • MSIC-Bench provides a framework to guide future development of clinical AI in MSI oncology.