Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Videos

Causal Prompts for Open-Vocabulary Video Instance Segmentation.

Rongkun Zheng, Lu Qi, Xi Chen

    IEEE Transactions on Pattern Analysis and Machine Intelligence
    |March 3, 2026
    PubMed
    Summary
    This summary is machine-generated.

    Related Concept Videos

    You might also read

    Related Articles

    Articles linked to this work by shared authors, journal, and citation graph.

    Sort by
    Same author

    FocalClick-XL: Towards Unified and High-quality Interactive Segmentation.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same author

    GPT4Point++: Advancing Unified Point-Language Understanding and Generation.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same author

    Toward Unified 3D Object Detection via Algorithm and Data Unification.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same author

    DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same author

    AnyDoor: Zero-Shot Image Customization With Region-to-Region Reference.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same author

    PonderV2: Improved 3D Representation With a Universal Pre-Training Paradigm.

    IEEE transactions on pattern analysis and machine intelligence·2025
    Same journal

    Relation DETR+: Exploring Explicit Position Relation Prior for Dense Prediction.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    CAFE: Cross-View Adaptive Fusion and Cluster Center Enhancement for Robust Multi-View Clustering.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same journal

    Learning Shape Anchors for Holistic Indoor Scene Understanding.

    IEEE transactions on pattern analysis and machine intelligence·2026
    See all related articles

    We introduce CPOVIS, a framework enhancing open-vocabulary video instance segmentation by using causal prompts from past frames. This improves object detection and tracking for novel categories in videos.

    Area of Science:

    • Computer Vision
    • Artificial Intelligence

    Background:

    • Open-vocabulary video instance segmentation aims to detect, segment, and track objects, including unknown categories.
    • Current methods often fail to utilize temporal information from previous frames, hindering generalization in open-world scenarios.

    Purpose of the Study:

    • To propose CPOVIS, a novel framework that enhances temporal reasoning and semantic consistency for open-vocabulary video instance segmentation.
    • To leverage causal prompts dynamically propagated from historical frames to improve performance on unseen object categories.

    Main Methods:

    • CPOVIS utilizes a Mask2Former architecture with a CLIP backbone, incorporating PromptCLIP for cross-modal alignment.
    • Key innovations include a Visual Prompt Injector for spatial-temporal coherence and a Taxonomy Prompt Infuser for semantic consistency.

    Related Experiment Videos

  • A contrastive learning strategy and adaptation of Segment Anything Model (SAM2) are employed to boost segmentation and tracking capabilities.
  • Main Results:

    • CPOVIS achieves state-of-the-art performance on seven challenging open- and closed-vocabulary video segmentation benchmarks.
    • The framework significantly outperforms existing methods in detecting, segmenting, and tracking objects, especially novel categories.
    • Causal prompt propagation is demonstrated to be crucial for advancing video understanding in open-world settings.

    Conclusions:

    • CPOVIS effectively addresses the limitations of existing methods by incorporating causal temporal cues.
    • The proposed framework demonstrates robust open-world generalization capabilities for video instance segmentation.
    • This work highlights the importance of causal prompt propagation for improving video analysis and object recognition in dynamic, open-world environments.