Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Multi-input and Multi-variable systems

Multi-input and Multi-variable systems

Cruise control systems in cars are designed as multi-input systems to maintain a driver's desired speed while compensating for external disturbances such as changes in terrain. The block diagram for a cruise control system typically includes two main inputs: the desired speed set by the driver and any external disturbances, such as the incline of the road. By adjusting the engine throttle, the system maintains the vehicle's speed as close to the desired value as possible.
In the absence...

Associative Learning

Associative Learning

Associative learning is a fundamental concept in behavioral psychology, wherein a connection is established between two stimuli or events, leading to a learned response. This process is critical in understanding how behaviors are acquired and modified. Conditioning, the mechanism through which associations are formed, can be divided into two main types: classical conditioning and operant conditioning, each elucidating different aspects of associative learning.
Classical conditioning, also known...

Labeling Emotion

Labeling Emotion

Emotional labeling is a cognitive process that involves identifying and naming one's emotions, such as anger, fear, happiness, or sadness. It allows individuals to recognize and express their internal emotional states, a critical aspect of emotional regulation and communication. Labeling emotions requires more than mere recognition; it also involves drawing upon memory and contextual cues to understand the current situation and apply a corresponding emotional label. For instance, feeling...

Force Classification

Force Classification

Forces play a crucial role in the study of physics and engineering. They are essential in describing the motion, behavior, and equilibrium of objects in the physical world. Forces can be classified based on their origin, type, and direction of action.
Contact and non-contact forces are two of the most widely used categories of forces. As the name suggests, contact forces require physical contact between two objects to act upon each other. Examples of contact forces include frictional,...

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

Introduction to Learning

Introduction to Learning

Learning is the process of acquiring knowledge or skills through practice or experience, leading to long-lasting behavioral changes. This acquisition occurs through interaction with the environment and requires practice or experience. For instance, mastering a skill such as surfing requires considerable practice and experience, highlighting the essential role of repeated interactions with the environment in learning.
In contrast to learned behaviors, unlearned behaviors such as crying, sexual...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Spatio-Temporal Representation Decoupling and Enhancement for Federated Instrument Segmentation in Surgical Videos.

IEEE transactions on medical imaging·2026

Same author

Addressing Client Drift in Federated Learning via Class-Prototype Similarity Distillation and Adaptive Mask.

IEEE transactions on cybernetics·2025

Same author

From pretraining to privacy: federated ultrasound foundation model with self-supervised learning.

NPJ digital medicine·2025

Same author

Federated Pseudo Modality Generation for Incomplete Multi-Modal MRI Reconstruction.

IEEE journal of biomedical and health informatics·2025

Same author

Achieving flexible fairness metrics in federated medical imaging.

Nature communications·2025

Same author

Federated Cross-Incremental Self-Supervised Learning for Medical Image Segmentation.

IEEE transactions on neural networks and learning systems·2024

Same journal

HardFlow: Hard-Constrained Sampling for Flow-Matching Models Via Trajectory Optimization.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Industrial Brain: Self-Evolving Neuro-Symbolic Autonomy with Causal Resilience for Cyber-Physical Systems.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Adaptive Hardness-Driven Dictionary Distillation for Incomplete Streaming View Clustering.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Achieving Text-based Person Retrieval with Any Granularity.

IEEE transactions on pattern analysis and machine intelligence·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Sep 20, 2025

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Published on: May 7, 2019

Text to Image for Multi-Label Image Recognition With Joint Prompt-Adapter Learning.

Chun-Mei Feng, Kai Yu, Xinxing Xu

IEEE Transactions on Pattern Analysis and Machine Intelligence

|May 26, 2025

Summary

This summary is machine-generated.

T2I-PAL reduces the modality gap in vision-language models by generating images from text, improving multi-label image recognition performance without manual annotation. This method enhances parameter-efficient fine-tuning (PEFT) for models like CLIP.

More Related Videos

Objectification of Tongue Diagnosis in Traditional Medicine, Data Analysis, and Study Application

Objectification of Tongue Diagnosis in Traditional Medicine, Data Analysis, and Study Application

Published on: April 14, 2023

Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention

Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention

Published on: December 15, 2023

Related Experiment Videos

Last Updated: Sep 20, 2025

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Combining Eye-tracking Data with an Analysis of Video Content from Free-viewing a Video of a Walk in an Urban Park Environment

Published on: May 7, 2019

Objectification of Tongue Diagnosis in Traditional Medicine, Data Analysis, and Study Application

Objectification of Tongue Diagnosis in Traditional Medicine, Data Analysis, and Study Application

Published on: April 14, 2023

Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention

Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention

Published on: December 15, 2023

Area of Science:

Computer Vision
Machine Learning
Artificial Intelligence

Background:

Vision-language models (VLMs) like CLIP leverage image-text contrastive learning for parameter-efficient fine-tuning (PEFT).
A significant challenge is the modality gap, limiting performance when using text as images (TaI).
Multi-label image recognition (MLR) requires robust feature representation to handle multiple object classes within an image.

Purpose of the Study:

To address the modality gap in VLMs for MLR using only text captions for PEFT.
To introduce T2I-PAL, a novel method that utilizes text-to-image generation to bridge the modality gap.
To enhance MLR performance and reduce the need for extensive manual annotation of training data.

Main Methods:

Leveraging pre-trained text-to-image models to generate diverse, realistic images from text captions, reducing the text-image modality gap.
Incorporating a class-wise heatmap and learnable prototypes to aggregate local similarities for robust visual feature representation.
Combining prompt tuning and adapter learning for improved parameter-efficient fine-tuning (PEFT) and classification accuracy.

Main Results:

T2I-PAL significantly reduces the modality gap between text and image representations.
The method enhances the robustness and informativeness of local visual features for MLR.
Experiments on MS-COCO, VOC2007, and NUS-WIDE benchmarks show an average performance boost of 3.47% over state-of-the-art methods.

Conclusions:

T2I-PAL effectively tackles the modality gap in vision-language models for multi-label image recognition.
The approach eliminates the need for fully semantically annotated training images, reducing manual annotation workload.
T2I-PAL preserves the CLIP model's intrinsic mode, enabling seamless integration with existing CLIP frameworks and improving recognition performance.