Evaluating Large Language Models on Aerospace Medicine Principles
View abstract on PubMed
Summary
This summary is machine-generated.Large language models (LLMs) show promise for space medicine decision support but require further development. While accurate, they exhibit knowledge gaps and inconsistencies, necessitating caution in autonomous medical operations.
Area Of Science
- Aerospace Medicine
- Artificial Intelligence in Healthcare
Background
- Large language models (LLMs) offer potential for clinical decision support in spaceflight.
- Incorrect information generated by LLMs poses risks in Earth-independent medical settings.
Purpose Of The Study
- To evaluate the performance of publicly available LLMs (ChatGPT-4, Gemini Advanced) and a custom Retrieval-Augmented Generation (RAG) LLM.
- To assess factual knowledge, clinical reasoning, and consistency of LLMs using aerospace medicine materials.
Main Methods
- Tested LLMs on 857 free-response and 20 multiple-choice aerospace medicine board questions.
- Evaluated reader scores (Likert scale 1-5) for free-response answers.
- Assessed correct response rates for multiple-choice questions.
Main Results
- ChatGPT-4, Gemini Advanced, and RAG LLM achieved mean reader scores of 4.23-5.00, 3.30-4.91, and 4.69-5.00, respectively.
- Correct response rates for multiple-choice questions were 70% (ChatGPT-4), 55% (Gemini Advanced), and 85% (RAG LLM).
- All LLMs demonstrated factual knowledge gaps and potential inconsistencies, with reasoning that may not pass board exams.
Conclusions
- LLMs show considerable promise for autonomous medical operations in spaceflight.
- Continued advancements in LLM training, data quality, and fine-tuning are anticipated.
- Careful validation and development are crucial before widespread clinical application in aerospace medicine.

