Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network With Token Migration

Summary

This summary is machine-generated.

We introduce Fast-iTPN, a novel vision transformer model that minimizes the gap between representation learning and downstream tasks. This efficient architecture accelerates inference by up to 70% with minimal performance loss.

Area Of Science

  • Computer Vision
  • Deep Learning
  • Artificial Intelligence

Background

  • Vision Transformer (ViT) models have shown great promise but often face challenges in bridging the gap between representation learning and downstream tasks.
  • Existing methods may incur significant computational overhead and slow inference speeds.

Purpose Of The Study

  • To propose an integrally pre-trained transformer pyramid network (iTPN) that jointly optimizes the network backbone and neck for minimal transfer gap.
  • To introduce Fast-iTPN, an efficient variant that reduces computational memory and accelerates inference.

Main Methods

  • iTPN utilizes the first pre-trained feature pyramid on ViT and multi-stage supervision with masked feature modeling (MFM).
  • Fast-iTPN incorporates token migration and token gathering techniques to reduce computational costs and memory overhead.
  • The model was evaluated on ImageNet-1K, COCO object detection, and ADE20K semantic segmentation benchmarks.

Main Results

  • Fast-iTPN achieved high top-1 accuracy on ImageNet-1K (88.75%/89.5% for base/large).
  • On COCO object detection and ADE20K semantic segmentation, Fast-iTPN demonstrated competitive performance (58.4%/58.8% box AP and 57.5%/58.7% mIoU, respectively).
  • Inference speed was accelerated by up to 70% with negligible performance degradation.

Conclusions

  • Fast-iTPN offers an efficient and effective backbone for various downstream computer vision tasks.
  • The proposed methods significantly improve inference speed without compromising accuracy.
  • This work presents a powerful and practical solution for real-world vision applications.