Disease Risk Prediction Using Structured EHR Data: Can Generalist Large Language Models Match Specialized Clinical Foundation Models? A Comparative Evaluation with Fine-Tuning | JoVE Visualize

Area of Science:

Artificial Intelligence in Healthcare
Clinical Informatics
Biomedical Data Science

Background:

Electronic health records (EHRs) are widely used with clinical decision support tools.
Clinical foundation models (CFMs) excel in predictive tasks using structured EHR data.
Generalist large language models (LLMs) are increasingly applied to healthcare, but their efficacy against specialized CFMs for disease prediction is unclear.

Purpose of the Study:

To compare the performance of CFMs against fine-tuned generalist LLMs and LLM-generated embeddings for disease risk prediction.
To evaluate model performance on diverse datasets including multi-site EHR, claims data, and an open-source benchmark.
To determine the optimal approach for leveraging AI in structured clinical data for predictive tasks.

Main Methods:

Compared specialized CFMs (Med-BERT, CLMBR) with fine-tuned generalist LLMs (Mistral, LLaMA-2/3/3.1) and a clinical LLM (Me-LLaMA).
Evaluated LLM-generated embeddings combined with simple classifiers (logistic regression, MLP) using models like DeepSeek, Qwen3, and GPT-OSS.
Assessed performance on heart failure risk (DHF) and pancreatic cancer diagnosis (PaCa) using AUROC and AUPRC metrics across multiple data sources.

Main Results:

Fine-tuned CFMs showed a small, statistically significant advantage over fine-tuned LLMs on larger EHR and claims datasets (<1% AUROC).
LLM-generated embeddings with lightweight classifiers achieved superior AUROC (>90%) and AUPRC (66%) compared to both fine-tuned CFMs and LLMs.
On the PaCa cohort, LLMs had higher AUROCs, but CFMs achieved significantly higher AUPRC.

Conclusions:

LLM-generated embeddings with simple classifiers represent a highly effective strategy for disease risk prediction, outperforming fine-tuned specialized and generalist models.
While generalist LLMs show potential, their computational cost and variable performance require careful consideration.
The study provides a reproducible framework for evaluating AI models in clinical settings, highlighting the efficacy of embedding-based approaches.