Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Apr 4, 2026

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
04:48

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Published on: July 5, 2024

874

WeakTr: Exploring Plain Vision Transformer for Weakly-Supervised Semantic Segmentation.

Lianghui Zhu, Yingyue Li, Jiemin Fang

    IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society
    |April 2, 2026
    PubMed
    Summary
    This summary is machine-generated.

    Related Concept Videos

    You might also read

    Related Articles

    Articles linked to this work by shared authors, journal, and citation graph.

    Sort by
    Same author

    Intergenerational effects of BDE-47 and its photodegradation products BDE-28 on the rotifer Brachionus plicatilis: Impacts and mechanisms based on sexual reproduction.

    Marine environmental research·2026
    Same author

    A transcriptional network underlying migratory cellular states and reduced 5-ALA-based photodynamic detectability in glioblastoma.

    Molecular and clinical oncology·2026
    Same author

    Spatially Oriented S-Scheme and Schottky Junction in In<sub>2</sub>S<sub>3</sub>/Ti<sub>3</sub>C<sub>2</sub>/TiO<sub>2</sub> Ternary Heterojunction for Efficient Photocatalytic H<sub>2</sub> Production.

    Molecules (Basel, Switzerland)·2026
    Same author

    An integrated method for railway fastener defect detection and geometric parameter measurement using 3D line laser sensor.

    PloS one·2026
    Same author

    Nickel-catalyzed CNTs enhancing cycle performance of Si@C anodes.

    RSC advances·2026
    Same author

    Leveraging intrinsic lignin component for toughening and hydro-stabilizing cellulose nanopaper.

    Journal of colloid and interface science·2026
    Same journal

    Change-Prior-Guided Unsupervised Change Detection of Heterogeneous Remote Sensing Images.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    AgonicDreamer: Enhancing Multi-View Consistency in Text-to-3D Generation via Rectified Score Distillation.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    BiCM-Prompt: Bidirectional Cross-Modal Prompt Tuning for Class-Incremental Learning on Multisource Remote Sensing Images.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    GoP-based Quality Enhancement on Video Compression.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    Align then Tensorize: Multi-Level Consistent Anchor Graph Learning for Scalable Multi-View Clustering.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same journal

    Beyond Fidelity: Diverse Image Synthesis via Retrieval-Augmented Diffusion.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    See all related articles

    This study analyzes Vision Transformers (ViTs) using weakly-supervised semantic segmentation (WSSS) and class activation maps (CAM). A new method adaptively fuses attention maps for better object segmentation, achieving state-of-the-art WSSS performance.

    Area of Science:

    • Computer Vision
    • Deep Learning
    • Artificial Intelligence

    Background:

    • Transformers, particularly Vision Transformers (ViTs), excel in computer vision tasks.
    • Understanding ViT mechanisms is crucial for advancing the field.
    • Weakly-supervised semantic segmentation (WSSS) and Class Activation Maps (CAM) are key for analyzing ViTs.

    Purpose of the Study:

    • To analyze the working mechanism of Vision Transformers (ViTs) using WSSS and CAM.
    • To propose a novel method for adaptively fusing self-attention maps from ViTs for improved WSSS and CAM generation.
    • To introduce an efficient and scalable ViT-based gradient clipping decoder for online retraining.

    Main Methods:

    • Utilized a plain ViT pre-trained on ImageNet for analysis.

    More Related Videos

    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images
    04:23

    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

    Published on: April 21, 2023

    2.4K

    Related Experiment Videos

    Last Updated: Apr 4, 2026

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique
    04:48

    Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

    Published on: July 5, 2024

    874
    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images
    04:23

    A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

    Published on: April 21, 2023

    2.4K
  • Developed a method to estimate attention head importance and adaptively fuse self-attention maps.
  • Proposed a ViT-based gradient clipping decoder for efficient online retraining.
  • Main Results:

    • Multi-layer, multi-head self-attention maps provide rich information for WSSS and CAM.
    • The proposed adaptive fusion method generates higher-quality CAMs with more complete objects.
    • The WeakTr method achieved superior WSSS performance: 78.5% mIoU on PASCAL VOC 2012 and 51.1% mIoU on COCO 2014.

    Conclusions:

    • Vision Transformers' self-attention maps contain valuable information for semantic segmentation and object localization.
    • The proposed adaptive fusion and gradient clipping decoder enhance ViT performance in WSSS.
    • The WeakTr method demonstrates significant improvements on standard WSSS benchmarks.