EndoViT: pretraining vision transformers on a large collection of endoscopic images
View abstract on PubMed
Summary
This summary is machine-generated.Domain-specific self-supervised pretraining using EndoViT significantly improves automated endoscopy video analysis. This approach enhances performance on complex surgical tasks like action triplet recognition and semantic segmentation, outperforming general pretraining methods.
Area Of Science
- Medical Computer Vision
- Surgical Data Science
Background
- Automated endoscopy video analysis is crucial for surgical assistance but hindered by complex scenes and limited annotated data.
- Large-scale pretraining, successful in NLP and computer vision, offers a solution by reducing reliance on annotated medical data.
Purpose Of The Study
- To investigate the effectiveness of endoscopy domain-specific self-supervised pretraining for Vision Transformers (ViTs).
- To develop and evaluate EndoViT, a ViT pretrained on a large endoscopic image corpus, for surgical downstream tasks.
Main Methods
- Collected Endo700k, the largest public corpus of over 700,000 endoscopic images from nine Minimally Invasive Surgery (MIS) datasets.
- Introduced EndoViT, a Vision Transformer pretrained on the Endo700k corpus.
- Evaluated EndoViT on diverse surgical downstream tasks, including action triplet recognition and semantic segmentation.
Main Results
- EndoViT demonstrated notable advantages in complex downstream tasks compared to ImageNet pretraining.
- Achieved superior performance in action triplet recognition.
- Surpassed state-of-the-art (SOTA) performance in semantic segmentation tasks.
Conclusions
- Domain-specific large-scale self-supervised pretraining is highly beneficial for Vision Transformers in medical computer vision.
- EndoViT effectively addresses challenges in automated endoscopy video analysis.
- Code and pretrained models are released to foster further research.

