Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Jan 18, 2026

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language
09:27

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published on: October 13, 2018

10.7K

Temporal Modeling With Frozen Vision-Language Foundation Models for Parameter-Efficient Text-Video Retrieval.

Leqi Shen, Tianxiang Hao, Tao He

    IEEE Transactions on Neural Networks and Learning Systems
    |September 9, 2025
    PubMed
    Summary
    This summary is machine-generated.

    Related Concept Videos

    You might also read

    Related Articles

    Articles linked to this work by shared authors, journal, and citation graph.

    Sort by
    Same author

    Application of digital health technologies in hypertension self-management: a narrative review.

    Frontiers in public health·2026
    Same author

    Exploring the Stochastic Regularisation in Normalisation Layers for Semi-Supervised Learning.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same author

    Unveiling VARS1: a key driver of colorectal cancer progression and immune modulation.

    International journal of clinical oncology·2026
    Same author

    Embodied Spatial Affordance: Spatial-Aware Affordance Learning for Embodied Navigation and Manipulation.

    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026
    Same author

    Paving the Way for Point Cloud Video Representation Learning Using a PDE Model.

    IEEE transactions on pattern analysis and machine intelligence·2026
    Same author

    Inflammation-Driven JNK Activation Promotes EMT and Metastasis in Gastric Cancer and Is Attenuated by Huangjin Shuangshen Granules.

    Pharmaceuticals (Basel, Switzerland)·2026
    Same journal

    Hidden Data Recovery and Forecasting via Next-Generation Reservoir Computing With Multiscale Delay Selection.

    IEEE transactions on neural networks and learning systems·2026
    Same journal

    CAFF-CIL: Causality-Aware Freedom Forgetting Approach for Class-Incremental Learning.

    IEEE transactions on neural networks and learning systems·2026
    Same journal

    Harmonic Autoencoding Framework for Multiple Tasks in Magnetic Particle Imaging Reconstruction.

    IEEE transactions on neural networks and learning systems·2026
    Same journal

    A Survey on Human-Centric Voice-Face Multimodal Learning.

    IEEE transactions on neural networks and learning systems·2026
    Same journal

    Vision-Assisted Foundation Model for Solving Multitask Vehicle Routing Problems.

    IEEE transactions on neural networks and learning systems·2026
    Same journal

    FP3O: Enabling Proximal Policy Optimization in Multiagent Cooperation With Parameter-Sharing Versatility.

    IEEE transactions on neural networks and learning systems·2026
    See all related articles

    This study introduces Temporal Modeling with Frozen Vision-Language Foundation Models (TFVL) for efficient text-video retrieval. TFVL significantly improves performance using fewer parameters by leveraging frozen models, outperforming current methods.

    Area of Science:

    • Computer Science
    • Artificial Intelligence
    • Machine Learning

    Background:

    • Temporal modeling is crucial for adapting text-image foundation models to text-video retrieval.
    • Existing methods often employ inefficient, heavy trainable modules like transformers or BiLSTMs.
    • There is a need for efficient temporal modeling techniques that leverage pretrained foundation models.

    Purpose of the Study:

    • To propose an efficient temporal modeling method for text-video retrieval using frozen vision-language foundation models.
    • To reduce the number of trainable parameters compared to existing approaches.
    • To enhance the performance of text-video retrieval systems.

    Main Methods:

    • Temporal Modeling with Frozen Vision-Language Foundation Models (TFVL) utilizes fixed encoders to model temporal dynamics.

    More Related Videos

    Measuring Attention and Visual Processing Speed by Model-based Analysis of Temporal-order Judgments
    13:00

    Measuring Attention and Visual Processing Speed by Model-based Analysis of Temporal-order Judgments

    Published on: January 23, 2017

    10.3K

    Related Experiment Videos

    Last Updated: Jan 18, 2026

    Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language
    09:27

    Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

    Published on: October 13, 2018

    10.7K
    Measuring Attention and Visual Processing Speed by Model-based Analysis of Temporal-order Judgments
    13:00

    Measuring Attention and Visual Processing Speed by Model-based Analysis of Temporal-order Judgments

    Published on: January 23, 2017

    10.3K
  • Text encoder temporal modeling (TextTemp) interprets frame representations as 'visual words' using a frozen text encoder.
  • Image encoder temporal modeling (ImageTemp) treats frame tokens as a unified visual entity with a frozen image encoder.
  • Main Results:

    • TFVL achieves significant performance gains on benchmark datasets (MSR-VTT, DiDeMo, ActivityNet, LSMDC).
    • On MSR-VTT, TFVL showed a 3.25% gain in R@1 with only 0.35% of the parameters compared to full fine-tuning.
    • The method demonstrates superior performance over state-of-the-art methods with substantially fewer trainable parameters.

    Conclusions:

    • TFVL offers an effective and parameter-efficient approach for temporal modeling in text-video retrieval.
    • Leveraging frozen foundation models is a viable strategy to avoid heavy trainable modules.
    • The proposed method sets a new standard for efficient and high-performing text-video retrieval systems.