A Multiassessment and Multiprofessional Agents Approach for Medical Chatbot Risk Estimation: Development and Evaluation Study | JoVE Visualize

Area of Science:

Artificial Intelligence in Healthcare
Natural Language Processing
Risk Assessment Methodologies

Background:

Assessing AI chatbot responses in medical, ethical, and legal domains is crucial for safe healthcare applications.
Current large language models (LLMs) lack specialized domain knowledge for accurate risk assessment.
Existing ensemble methods struggle to resolve disagreements, leading to misclassification and risk assessment challenges.

Purpose of the Study:

To design, develop, and evaluate a synergistic multi-assessment (MA) and multiprofessional agents (MPA) approach for chatbot risk assessment.
To improve the accuracy and reliability of risk estimation in AI-driven healthcare interactions.
To address limitations of general LLMs and ensemble methods in specialized domains.

Main Methods:

Developed a multi-assessment (MA) framework with initial (MA1) and final (MA3) assessments, incorporating a verification assessment (MA2) using specialized agents (MPA) for medical, ethical, and legal domains.
Evaluated the approach on the MedNLP-CHAT corpus (N=226) using baseline, enhanced prompt, embedding-based search, and retrieval-augmented generation (RAG) methods.
Utilized macro F1-score and joint accuracy as primary metrics, supported by confidence intervals and paired macro F1-score differences.

Main Results:

The MA-MPA framework with RAG achieved a macro F1-score of 0.800 and 60.3% joint accuracy, outperforming existing systems in ethical and legal domains.
The MA approach significantly improved performance, with gains from MA1 to MA2 ranging from +0.176 to +0.214.
While MPA integration with MA and external knowledge showed gains, joint accuracy improvements were not consistently evident, and MA alone surpassed RAG in joint accuracy.

Conclusions:

The MA-MPA approach demonstrates potential for enhancing chatbot risk estimation, especially when combined with external knowledge.
The framework shows promise for improving balanced overall performance, though the medical domain remains a challenge.
Further improvements in contextually grounded risk estimation may be achieved through more specialized LLMs.