Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: May 28, 2026

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment
08:25

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Published on: May 7, 2019

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking.

Zhengbo Zhang, Zhigang Tu, Junsong Yuan

    IEEE Transactions on Pattern Analysis and Machine Intelligence
    |May 26, 2026
    PubMed
    Summary
    This summary is machine-generated.

    Related Concept Videos

    You might also read

    Related Articles

    Articles linked to this work by shared authors, journal, and citation graph.

    Sort by
    Same author

    LoRASculpt: Harmonious Low-Rank Adaptation for Multimodal Large Language Models.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same author

    Towards clinical-level interpretation of dental panoramic radiography using an instance-guided vision-language model.

    Nature biomedical engineering·2026
    Same author

    Systemic immune-inflammation index predicts post-thrombectomy outcomes and reveals a mediating role in the association between neurocardiac stress and prognosis: a multicenter study.

    Frontiers in neurology·2026
    Same author

    Holistic Invariant Retracing for Distortion-Resilient Multi-Modal Learning in Spatial Transcriptomics.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same author

    Differentiable Clustering Graph Convolutional Network for Hyperspectral Unmixing: Methodology and Benchmark.

    IEEE transactions on neural networks and learning systems·2026
    Same author

    MUP-SAM: Multi-scale vision mamba UNet prompt generation for SAM in multi-organ medical image segmentation.

    Neural networks : the official journal of the International Neural Network Society·2026
    Same journal

    Relation DETR+: Exploring Explicit Position Relation Prior for Dense Prediction.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    CAFE: Cross-View Adaptive Fusion and Cluster Center Enhancement for Robust Multi-View Clustering.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    Learning Shape Anchors for Holistic Indoor Scene Understanding.

    IEEE transactions on pattern analysis and machine intelligence·2026
    See all related articles

    This study introduces Diff-Tracking, a novel unsupervised visual object tracking method. It leverages text-to-image diffusion models to accurately follow targets in videos without needing ground-truth annotations.

    Area of Science:

    • Computer Vision
    • Artificial Intelligence
    • Machine Learning

    Background:

    • Unsupervised visual object tracking is complex, especially for targets requiring detailed semantic and structural understanding.
    • Existing methods often fail in scenarios demanding fine-grained visual analysis.

    Purpose of the Study:

    • To develop an unsupervised visual object tracking method that utilizes the semantic understanding capabilities of text-to-image diffusion models.
    • To address the limitations of current trackers in handling complex visual information.

    Main Methods:

    • Reinterpreting text-to-image diffusion models as a bridge between text and image modalities using cross-attention mechanisms.
    • Developing an initial prompt learner to identify the target object in the first frame.
    • Implementing an online prompt updater that refines the prompt using motion information for consistent tracking.

    More Related Videos

    A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers
    12:39

    A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

    Published on: January 18, 2020

    Related Experiment Videos

    Last Updated: May 28, 2026

    Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment
    08:25

    Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

    Published on: May 7, 2019

    A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers
    12:39

    A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

    Published on: January 18, 2020

    Main Results:

    • The proposed Diff-Tracking method demonstrates strong performance on six challenging tracking datasets.
    • It achieves competitive results compared to existing state-of-the-art unsupervised trackers.
    • The approach effectively utilizes semantic knowledge from diffusion models for robust tracking.

    Conclusions:

    • Diff-Tracking offers a new perspective on unsupervised object tracking by harnessing the power of pretrained text-to-image diffusion models.
    • The method shows significant potential for improving the accuracy and robustness of visual object tracking in complex scenarios.
    • This work highlights the adaptability of diffusion models beyond image generation for downstream tasks.