Vision-language models for automated video analysis and documentation in laparoscopic surgery: a proof-of-concept study
View abstract on PubMed
Summary
This summary is machine-generated.Vision-Language Models (VLMs) show promise for surgical documentation. GPT-4o and Gemini-1.5-pro reliably detected surgical tools and classified procedures, though grading pathology requires further development.
Area Of Science
- Artificial Intelligence
- Medical Informatics
- Computer Vision
Background
- The healthcare industry faces a critical shortage of medical personnel, necessitating automation in clinical documentation.
- Large Vision-Language Models (VLMs) present a significant opportunity to streamline surgical documentation and enhance intraoperative analysis.
Purpose Of The Study
- To comparatively evaluate the performance of two general-purpose VLMs, GPT-4o and Gemini-1.5-pro, in surgical video analysis.
- To assess the efficacy of in-context learning (ICL) in improving VLM performance for surgical tasks.
Main Methods
- An observational study compared GPT-4o and Gemini-1.5-pro using 30 surgical videos (15 cholecystectomy, 15 appendectomy).
- Tasks included object detection, surgery classification, appendicitis grading, and surgical report generation.
- Performance was evaluated using descriptive accuracy metrics, with and without in-context learning.
Main Results
- Both VLMs achieved 100% accuracy in identifying vessel clips. GPT-4o excelled in retrieval bag and gauze detection, while Gemini-1.5-pro demonstrated superior bleeding detection.
- Gemini-1.5-pro was more accurate in classifying cholecystectomies, whereas both models showed moderate performance in appendectomy classification.
- Appendicitis grading yielded limited accuracy for both models. In-context learning improved tool recognition but had inconsistent effects on other tasks.
Conclusions
- Domain-agnostic VLMs demonstrate reliable performance in surgical object detection and procedure classification.
- Limitations exist in pathology grading and detailed procedural step description, areas where in-context learning shows potential for enhancement.
- Future development of domain-specific VLMs could further revolutionize operating room efficiency and support.

