Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Jun 12, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Understanding Transformer-Based Classifications of Medical Text Using a Large Language Model for the Attribution of

Fangwen Zhou1, Ashirbani Saha2, Muhammad Afzal3

  • 1Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Hamilton, ON, L8S 4L8, Canada, 1 905-525-9140 ext 22208.

JMIR Medical Informatics
|June 10, 2026
PubMed
Summary

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Fine-Tuning and Benchmarking Transformer Models for Multiclass Classification of Clinical Research Papers: Retrospective Modeling Study.

JMIR AI·2026
Same author

Zero-shot interpretable biomedical literature appraisal with generative large language models.

JAMIA open·2026
Same author

What You May Have Missed in 2025.

Annals of internal medicine·2026
Same author

Attitudes of medical and life sciences university students and postdoctoral fellows toward AI chatbots in education: an international cross-sectional survey.

Scientific reports·2026
Same author

Evaluation of the Burden of Bone Fractures in People Living With Haemophilia: A Registry-Based Matched Cohort Study.

Haemophilia : the official journal of the World Federation of Hemophilia·2026
Same author

GRADE Guidance: Update on Developing Good Practice Statements in Guidelines.

Annals of internal medicine·2026
Same journal

Selecting, Scaling, and Measuring the Value of Ambient AI in a Nonacademic Health System: Multiphase Pilot Study.

JMIR medical informatics·2026
Same journal

Prediction of Early Hospital Admission (≤24 Hours) After Stroke Using Machine Learning and Deep Learning: Multicenter Study From China.

JMIR medical informatics·2026
Same journal

Assessing the Feasibility and Acceptability of Implementing a Preclinic Vital Signs Assessment in Primary Care: Cross-Sectional Pilot Study.

JMIR medical informatics·2026
Same journal

Candidate Passive Sensor Suite Technologies for Tactical Combat Casualty Care Environments: Comparative Assessment Study.

JMIR medical informatics·2026
Same journal

Relevance of the uMap Collaborative Platform as Support for Choropleth Mapping: A Traffic‒Light Statistical Signal Atlas of All-Cause Mortality-First French Lockdown.

JMIR medical informatics·2026
Same journal

Ambient AI Scribe Implementation in an Ambulatory Setting in a Single Medical Group: Prospective Study.

JMIR medical informatics·2026
See all related articles
This summary is machine-generated.

Generative large language models like GPT-4o struggle as standalone explainers for biomedical text classification. Traditional methods like SHAP and integrated gradients (IG) offer more reliable and efficient explanations for model interpretability.

Area of Science:

  • Biomedical Informatics
  • Artificial Intelligence
  • Natural Language Processing

Background:

  • Deep learning models, particularly transformer architectures, excel in biomedical literature classification but lack interpretability.
  • Explainable AI (XAI) methods like SHAP and integrated gradients (IG) improve transparency but are computationally intensive.
  • Generative large language models (LLMs) present a potential new avenue for creating interpretable, context-aware explanations.

Purpose of the Study:

  • To evaluate GPT-4o as a standalone, end-to-end perturbation-based explainer for BioLinkBERT text classification.
  • To compare GPT-4o's explanation faithfulness and semantic alignment against established SHAP and IG baselines.
  • To assess the computational efficiency and cost-effectiveness of GPT-4o compared to traditional XAI methods.
Keywords:
GPTSHAPShapley Additive Explanationsartificial intelligencedeep learningexplainable artificial intelligencefeature attributionintegrated gradientsnatural language processing

Related Experiment Videos

Last Updated: Jun 12, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Main Methods:

  • A fine-tuned BioLinkBERT model classified 200 studies from McMaster PLUS and Clinical Hedges for methodological rigor.
  • Stratified sampling over-represented low-confidence predictions to rigorously test explainers.
  • GPT-4o, SHAP, and IG generated token-level feature attributions, with GPT-4o using iterative masking under two prompting schemes.
  • Explanation quality was assessed using modified area over the perturbation curve (AOPC) and correlation analyses.

Main Results:

  • SHAP (AOPC 0.222) and IG (AOPC 0.225) provided consistent and faithful explanations, identifying key tokens related to study rigor.
  • GPT-4o exhibited significantly lower faithfulness (AOPC 0.025-0.029) and produced divergent attributions.
  • Correlation analysis showed moderate alignment between SHAP and IG (r=0.367), but limited correlation with GPT-4o (r≤0.032).
  • GPT-4o was computationally intensive and costly, while IG was the most time-efficient.

Conclusions:

  • Current generative LLMs are limited as standalone perturbation explainers for biomedical text classification.
  • GPT-4o struggles with accurate feature importance synthesis via iterative masking, lacking the reliability of traditional XAI frameworks.
  • Future research should explore specialized prompt engineering, whole-word strategies, and hybrid approaches for LLM-based explanations.