Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room
View abstract on PubMed
Summary
This summary is machine-generated.Large language models (LLMs) show promise in predicting emergency department admissions, significantly improving with real-world data and machine learning probabilities. Further refinement is needed for clinical integration.
Area Of Science
- Medical Informatics
- Artificial Intelligence in Healthcare
- Clinical Decision Support
Background
- Artificial intelligence (AI) and large language models (LLMs) offer potential for enhancing emergency room (ER) operations, particularly in patient admission decision-making.
- Existing research lacks studies on LLMs using real-world data and scenarios compared to traditional supervised machine learning (ML) models.
- This study evaluates GPT-4's performance in predicting patient admissions from emergency department (ED) visits against traditional ML models.
Purpose Of The Study
- To assess the performance of GPT-4 in predicting emergency department patient admissions.
- To compare GPT-4's predictive capabilities with traditional supervised machine learning models.
- To investigate the impact of few-shot learning and numerical probabilities on LLM performance.
Main Methods
- A retrospective study utilizing electronic health records from 7 NYC hospitals.
- Training of Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively.
- Assessment of GPT-4 performance using Zero-shot, Few-shot with retrieval-augmented generation (RAG), and with/without ML numerical probabilities.
Main Results
- The ensemble ML model achieved an AUC of 0.88, AUPRC of 0.72, and 82.9% accuracy.
- Naïve GPT-4 performance (AUC 0.79, AUPRC 0.48, accuracy 77.5%) improved significantly with RAG and ML probabilities (AUC 0.87, AUPRC 0.71, accuracy 83.1%).
- RAG alone boosted GPT-4 performance to an AUC of 0.82, AUPRC of 0.56, and accuracy of 81.3%.
Conclusions
- While naïve LLMs have limited performance, their predictive accuracy for ED admissions substantially increases when augmented with real-world examples (RAG) and/or ML probabilities.
- GPT-4's peak performance, though slightly below the pure ML model, is significant, especially considering its capacity for providing predictive reasoning.
- Further development and refinement of LLMs with real-world data are crucial for their successful integration as decision-support tools in clinical settings.
Related Concept Videos
Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways.
Unfortunately, being consistent in measurement does not necessarily mean that you have measured something correctly. To illustrate this concept, consider a kitchen...
A ROC (Receiver Operating Characteristic) plot is a graphical tool used to assess the performance of a binary classification model by illustrating the trade-off between sensitivity (true positive rate) and specificity (false positive rate). By plotting sensitivity against 1 - specificity across various threshold settings, the ROC curve shows how well the model distinguishes between classes, with a curve closer to the top-left corner indicating a more accurate model. The area under the ROC curve...
In healthcare diagnostics, laboratory tests play a crucial role in identifying and diagnosing a wide range of medical conditions. However, interpreting test results is not always straightforward. An abnormal test result does not always confirm the presence of a disease, just as a normal result does not guarantee its absence. To assess the reliability of these diagnostic tools, healthcare practitioners rely on two key statistical indicators: sensitivity and specificity.
Sensitivity is the...
The actuarial approach, a statistical method originally developed for life insurance risk assessment, is widely used to calculate survival rates in clinical and population studies. This method accounts for participants lost to follow-up or those who die from causes unrelated to the study, ensuring a more accurate representation of survival probabilities.
Consider the example of a high-risk surgical procedure with significant early-stage mortality. A two-year clinical study is conducted,...

