Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Augmenting large language models with clinical knowledge graph for personalized perioperative fluid therapy question answering.

PLOS digital health·2026

Same author

PEPRKD-depression: A knowledge database supporting evidence-based personalized exercise prescription recommendations in depression.

Digital health·2026

Same author

Biobanking for intelligent medicine: assessment and evaluation with the SHARE principle.

Journal of the American Medical Informatics Association : JAMIA·2026

Same author

MSICKB: A Curated Knowledgebase for Exploring Molecular Heterogeneity and Biomarker Prioritization in Microsatellite Instability Cancers.

Computational and structural biotechnology journal·2026

Same author

A Dataset for Evaluating Large Language Models on Chinese National Medical Licensing Examinations.

Scientific data·2026

Same author

From policy to practice: Evaluating the global implications of the FDA's PDURS framework.

Digital health·2026

Same journal

Evaluating the Evidence Base for New Mental Health Tech With APA Labs.

Journal of medical Internet research·2026

Same journal

Radiomics-Based AI for the Diagnosis and Prognosis of Vessels Encapsulating Tumor Clusters in Hepatocellular Carcinoma: Systematic Review and Meta-Analysis.

Journal of medical Internet research·2026

Same journal

Development and Validation of an Explainable Machine Learning Model to Assess the Prevalence Probability of Gastrointestinal Heat Retention Syndrome in Children: Cross-Sectional Study.

Journal of medical Internet research·2026

Same journal

Assessment of a Digital Health Platform Using Web Analytics and User Experience Measurements: Quantitative Study Based on RE-AIM.

Journal of medical Internet research·2026

Same journal

Sensor-Based Monitoring of Knee Osteoarthritis Symptoms in Free-Living Settings: Scoping Review.

Journal of medical Internet research·2026

Same journal

Effects of Immersive Virtual Reality Interventions on Symptom Management in Patients With Gastrointestinal Cancer: Systematic Review and Meta-Analysis of Randomized Controlled Trials.

Journal of medical Internet research·2026

See all related articles

Search research articles

Related Experiment Video

Updated: May 23, 2026

Implementation of In Vitro Drug Resistance Assays: Maximizing the Potential for Uncovering Clinically Relevant Resistance Mechanisms

Implementation of In Vitro Drug Resistance Assays: Maximizing the Potential for Uncovering Clinically Relevant Resistance Mechanisms

Published on: December 9, 2015

Benchmarking Large Language Models and Prompt Engineering Strategies in Microsatellite Instability Cancers:

Yuxin Zhang¹, Jie Song¹, Cheng Bi¹

¹Department of Medical Oncology, Institutes for Systems Genetics, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, No 2222 Xinchuan Road, Gaoxin District, Chengdu, Sichuan, 610000, China, 86 15995854635, 86 28 61528682.

Journal of Medical Internet Research

|May 21, 2026

Summary

This summary is machine-generated.

Large language models (LLMs) struggle with microsatellite instability (MSI) cancer tasks. Retrieval-augmented generation (RAG) significantly improves accuracy and safety, but requires optimized retrieval and knowledge bases for reliable clinical AI.

Keywords:

LLM benchmark cancer large language model microsatellite instability prompt engineering

More Related Videos

Integration of Wet and Dry Bench Processes Optimizes Targeted Next-generation Sequencing of Low-quality and Low-quantity Tumor Biopsies

Integration of Wet and Dry Bench Processes Optimizes Targeted Next-generation Sequencing of Low-quality and Low-quantity Tumor Biopsies

Published on: April 11, 2016

Comparative Lesions Analysis Through a Targeted Sequencing Approach

Comparative Lesions Analysis Through a Targeted Sequencing Approach

Published on: November 5, 2019

Related Experiment Videos

Last Updated: May 23, 2026

Implementation of In Vitro Drug Resistance Assays: Maximizing the Potential for Uncovering Clinically Relevant Resistance Mechanisms

Implementation of In Vitro Drug Resistance Assays: Maximizing the Potential for Uncovering Clinically Relevant Resistance Mechanisms

Published on: December 9, 2015

Integration of Wet and Dry Bench Processes Optimizes Targeted Next-generation Sequencing of Low-quality and Low-quantity Tumor Biopsies

Integration of Wet and Dry Bench Processes Optimizes Targeted Next-generation Sequencing of Low-quality and Low-quantity Tumor Biopsies

Published on: April 11, 2016

Comparative Lesions Analysis Through a Targeted Sequencing Approach

Comparative Lesions Analysis Through a Targeted Sequencing Approach

Published on: November 5, 2019

Area of Science:

Artificial Intelligence in Oncology
Clinical Decision Support Systems
Biomedical Natural Language Processing

Background:

General-purpose large language models (LLMs) have uncharacterized reliability for complex clinical tasks in specialized domains like microsatellite instability (MSI) cancers.
The lack of a domain-specific benchmark for evaluating LLM capabilities in MSI oncology poses risks to patient safety.

Purpose of the Study:

To develop and validate the Microsatellite Instability Cancer Benchmark (MSIC-Bench) for evaluating LLMs in MSI oncology.
To systematically assess LLM performance across prompting strategies and identify areas for improvement.

Main Methods:

Developed MSIC-Bench, a 511-question benchmark from clinical guidelines and curated knowledge.
Evaluated three state-of-the-art LLMs (GPT-4o, Gemini 2.5 Pro, Claude Opus 4) using four prompting strategies (vanilla, chain-of-thought, reflection of thoughts, RAG).
Assessed performance based on accuracy, safety, error composition, and token usage across multiple-choice and open-ended modalities.

Main Results:

LLMs exhibited a 'scaffolding effect,' with accuracy decreasing in open-ended scenarios.
Retrieval-augmented generation (RAG) was the most effective intervention, shifting bottlenecks from knowledge deficits to retrieval failures.
RAG improved accuracy and safety by reducing fabrications, though it introduced a trade-off with false refusals. Hybrid-RAG showed robust performance.

Conclusions:

Current LLMs lack specialized knowledge for MSI oncology; RAG is crucial for addressing this gap.
Optimizing RAG requires focusing on retrieval precision and high-quality knowledge bases for trustworthy clinical AI.
MSIC-Bench provides a framework to guide future development of clinical AI in MSI oncology.