Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: Jan 18, 2026

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published on: October 13, 2018

Temporal Modeling With Frozen Vision-Language Foundation Models for Parameter-Efficient Text-Video Retrieval.

Leqi Shen, Tianxiang Hao, Tao He

IEEE Transactions on Neural Networks and Learning Systems

|September 9, 2025

Summary

This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Application of digital health technologies in hypertension self-management: a narrative review.

Frontiers in public health·2026

Same author

Exploring the Stochastic Regularisation in Normalisation Layers for Semi-Supervised Learning.

IEEE transactions on pattern analysis and machine intelligence·2026

Same author

Unveiling VARS1: a key driver of colorectal cancer progression and immune modulation.

International journal of clinical oncology·2026

Same author

Embodied Spatial Affordance: Spatial-Aware Affordance Learning for Embodied Navigation and Manipulation.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same author

Paving the Way for Point Cloud Video Representation Learning Using a PDE Model.

IEEE transactions on pattern analysis and machine intelligence·2026

Same author

Inflammation-Driven JNK Activation Promotes EMT and Metastasis in Gastric Cancer and Is Attenuated by Huangjin Shuangshen Granules.

Pharmaceuticals (Basel, Switzerland)·2026

Same journal

Hidden Data Recovery and Forecasting via Next-Generation Reservoir Computing With Multiscale Delay Selection.

IEEE transactions on neural networks and learning systems·2026

Same journal

CAFF-CIL: Causality-Aware Freedom Forgetting Approach for Class-Incremental Learning.

IEEE transactions on neural networks and learning systems·2026

Same journal

Harmonic Autoencoding Framework for Multiple Tasks in Magnetic Particle Imaging Reconstruction.

IEEE transactions on neural networks and learning systems·2026

Same journal

A Survey on Human-Centric Voice-Face Multimodal Learning.

IEEE transactions on neural networks and learning systems·2026

Same journal

Vision-Assisted Foundation Model for Solving Multitask Vehicle Routing Problems.

IEEE transactions on neural networks and learning systems·2026

Same journal

FP3O: Enabling Proximal Policy Optimization in Multiagent Cooperation With Parameter-Sharing Versatility.

IEEE transactions on neural networks and learning systems·2026

See all related articles

This study introduces Temporal Modeling with Frozen Vision-Language Foundation Models (TFVL) for efficient text-video retrieval. TFVL significantly improves performance using fewer parameters by leveraging frozen models, outperforming current methods.

Area of Science:

Computer Science
Artificial Intelligence
Machine Learning

Background:

Temporal modeling is crucial for adapting text-image foundation models to text-video retrieval.
Existing methods often employ inefficient, heavy trainable modules like transformers or BiLSTMs.
There is a need for efficient temporal modeling techniques that leverage pretrained foundation models.

Purpose of the Study:

To propose an efficient temporal modeling method for text-video retrieval using frozen vision-language foundation models.
To reduce the number of trainable parameters compared to existing approaches.
To enhance the performance of text-video retrieval systems.

Main Methods:

Temporal Modeling with Frozen Vision-Language Foundation Models (TFVL) utilizes fixed encoders to model temporal dynamics.

More Related Videos

Measuring Attention and Visual Processing Speed by Model-based Analysis of Temporal-order Judgments

Measuring Attention and Visual Processing Speed by Model-based Analysis of Temporal-order Judgments

Published on: January 23, 2017

Related Experiment Videos

Last Updated: Jan 18, 2026

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published on: October 13, 2018

Measuring Attention and Visual Processing Speed by Model-based Analysis of Temporal-order Judgments

Measuring Attention and Visual Processing Speed by Model-based Analysis of Temporal-order Judgments

Published on: January 23, 2017

Text encoder temporal modeling (TextTemp) interprets frame representations as 'visual words' using a frozen text encoder.

Image encoder temporal modeling (ImageTemp) treats frame tokens as a unified visual entity with a frozen image encoder.

Main Results:

TFVL achieves significant performance gains on benchmark datasets (MSR-VTT, DiDeMo, ActivityNet, LSMDC).
On MSR-VTT, TFVL showed a 3.25% gain in R@1 with only 0.35% of the parameters compared to full fine-tuning.
The method demonstrates superior performance over state-of-the-art methods with substantially fewer trainable parameters.

Conclusions:

TFVL offers an effective and parameter-efficient approach for temporal modeling in text-video retrieval.
Leveraging frozen foundation models is a viable strategy to avoid heavy trainable modules.
The proposed method sets a new standard for efficient and high-performing text-video retrieval systems.