Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: May 28, 2026

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Published on: May 7, 2019

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking.

Zhengbo Zhang, Zhigang Tu, Junsong Yuan

IEEE Transactions on Pattern Analysis and Machine Intelligence

|May 26, 2026

Summary

This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

LoRASculpt: Harmonious Low-Rank Adaptation for Multimodal Large Language Models.

IEEE transactions on pattern analysis and machine intelligence·2026

Same author

Towards clinical-level interpretation of dental panoramic radiography using an instance-guided vision-language model.

Nature biomedical engineering·2026

Same author

Systemic immune-inflammation index predicts post-thrombectomy outcomes and reveals a mediating role in the association between neurocardiac stress and prognosis: a multicenter study.

Frontiers in neurology·2026

Same author

Holistic Invariant Retracing for Distortion-Resilient Multi-Modal Learning in Spatial Transcriptomics.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same author

Differentiable Clustering Graph Convolutional Network for Hyperspectral Unmixing: Methodology and Benchmark.

IEEE transactions on neural networks and learning systems·2026

Same author

MUP-SAM: Multi-scale vision mamba UNet prompt generation for SAM in multi-organ medical image segmentation.

Neural networks : the official journal of the International Neural Network Society·2026

Same journal

Relation DETR+: Exploring Explicit Position Relation Prior for Dense Prediction.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

CAFE: Cross-View Adaptive Fusion and Cluster Center Enhancement for Robust Multi-View Clustering.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Learning Shape Anchors for Holistic Indoor Scene Understanding.

IEEE transactions on pattern analysis and machine intelligence·2026

See all related articles

This study introduces Diff-Tracking, a novel unsupervised visual object tracking method. It leverages text-to-image diffusion models to accurately follow targets in videos without needing ground-truth annotations.

Area of Science:

Computer Vision
Artificial Intelligence
Machine Learning

Background:

Unsupervised visual object tracking is complex, especially for targets requiring detailed semantic and structural understanding.
Existing methods often fail in scenarios demanding fine-grained visual analysis.

Purpose of the Study:

To develop an unsupervised visual object tracking method that utilizes the semantic understanding capabilities of text-to-image diffusion models.
To address the limitations of current trackers in handling complex visual information.

Main Methods:

Reinterpreting text-to-image diffusion models as a bridge between text and image modalities using cross-attention mechanisms.
Developing an initial prompt learner to identify the target object in the first frame.
Implementing an online prompt updater that refines the prompt using motion information for consistent tracking.

More Related Videos

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

Published on: January 18, 2020

Related Experiment Videos

Last Updated: May 28, 2026

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Published on: May 7, 2019

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

Published on: January 18, 2020

Main Results:

The proposed Diff-Tracking method demonstrates strong performance on six challenging tracking datasets.
It achieves competitive results compared to existing state-of-the-art unsupervised trackers.
The approach effectively utilizes semantic knowledge from diffusion models for robust tracking.

Conclusions:

Diff-Tracking offers a new perspective on unsupervised object tracking by harnessing the power of pretrained text-to-image diffusion models.
The method shows significant potential for improving the accuracy and robustness of visual object tracking in complex scenarios.
This work highlights the adaptability of diffusion models beyond image generation for downstream tasks.