Clinical Risk Computation by Large Language Models Using Validated Risk Scores
View abstract on PubMed
Summary
This summary is machine-generated.Large Language Models (LLMs) can reliably calculate clinical risk scores, enhancing healthcare workflows. GPT-4o-mini and Gemini 2.5 Flash showed high accuracy, though complex scores like Framingham remain challenging for AI.
Area Of Science
- Artificial Intelligence in Medicine
- Clinical Informatics
- Natural Language Processing
Background
- Large Language Models (LLMs) offer advanced natural language understanding for healthcare.
- Direct LLM risk prediction is unreliable due to bias and data complexity.
- Using LLMs to compute established clinical risk scores enhances validity and transparency.
Purpose Of The Study
- To evaluate the accuracy of public LLMs in calculating validated clinical risk scores.
- To compare the performance of GPT-4o-mini, DeepSeek v3, and Google Gemini 2.5 Flash.
- To assess LLM reliability for enhancing clinical workflows through interpretable score calculation.
Main Methods
- Generated 100 diverse patient profiles as natural language clinical notes.
- Utilized LLMs (GPT-4o-mini, DeepSeek v3, Gemini 2.5 Flash) to extract data and compute five clinical risk scores.
- Compared LLM-computed scores against reference scores using accuracy, precision, recall, F1, and Pearson correlation.
Main Results
- GPT-4o-mini and Gemini 2.5 Flash demonstrated near-perfect agreement with reference scores for most clinical risk assessments.
- DeepSeek v3 showed lower performance compared to GPT-4o-mini and Gemini 2.5 Flash.
- All evaluated LLMs encountered difficulties with the complex Framingham Risk Score calculation.
Conclusions
- LLMs can accurately compute established clinical risk scores, offering a trustworthy alternative to direct AI risk prediction.
- GPT-4o-mini and Gemini 2.5 Flash show promise for integrating into clinical workflows for risk score calculation.
- Further development is needed to address LLM challenges with highly complex risk assessment formulas.
Related Concept Videos
Biopharmaceutical studies constitute a vital field aiming to enhance drug delivery methods and refine therapeutic approaches, drawing upon diverse interdisciplinary knowledge. In research methodologies, the choice between controlled and non-controlled studies significantly influences the study's reliability and accuracy.
Non-controlled studies, commonly employed for initial exploration, lack a control group, rendering them susceptible to biases and external influences. In contrast,...
Relative risk (RR) is a statistical measure commonly used in epidemiology to compare the likelihood of a particular event occurring between two groups. This metric is important for evaluating the relationship between exposure to a specific risk factor and the probability of a particular outcome. It plays a crucial role in medical research, public health studies, and risk assessment. Relative risk quantifies how much more (or less) likely an event is to occur in an exposed group compared to an...
Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
The actuarial approach, a statistical method originally developed for life insurance risk assessment, is widely used to calculate survival rates in clinical and population studies. This method accounts for participants lost to follow-up or those who die from causes unrelated to the study, ensuring a more accurate representation of survival probabilities.
Consider the example of a high-risk surgical procedure with significant early-stage mortality. A two-year clinical study is conducted,...
The hazard rate, also known as the hazard function or failure rate, is a statistical measure used to describe the instantaneous rate at which an event occurs, given that the event has not yet happened. From a probabilistic perspective, it represents the likelihood that a subject will experience the event in a very small time interval, conditional on surviving up to the beginning of that interval. In terms of frequency, the hazard rate can be viewed as the ratio of the number of events to the...

