Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Improving Translational Accuracy

Improving Translational Accuracy

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Optimization of Process Parameters of Rhamnolipid Treatment of Oily Sludge Based on Response Surface Methodology.

ACS omega·2020

Same author

Safety and Long-term Scleral Biomechanical Stability of Rhesus Eyes after Scleral Cross-linking by Blue Light.

Current eye research·2020

Same author

Serum pentraxin 3 as a biomarker for prognosis of acute minor stroke due to large artery atherosclerosis.

Brain and behavior·2020

Same author

The roles of adenosine deaminase in autoimmune diseases.

Autoimmunity reviews·2020

Same author

The role of oxidative stress in association between disinfection by-products exposure and semen quality: A mediation analysis among men from an infertility clinic.

Chemosphere·2020

Same author

Establishment of immune prognostic signature and analysis of prospective molecular mechanisms in childhood osteosarcoma patients.

Medicine·2020

Same journal

Change-Prior-Guided Unsupervised Change Detection of Heterogeneous Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

AgonicDreamer: Enhancing Multi-View Consistency in Text-to-3D Generation via Rectified Score Distillation.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

BiCM-Prompt: Bidirectional Cross-Modal Prompt Tuning for Class-Incremental Learning on Multisource Remote Sensing Images.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

GoP-based Quality Enhancement on Video Compression.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

Align then Tensorize: Multi-Level Consistent Anchor Graph Learning for Scalable Multi-View Clustering.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

Same journal

Beyond Fidelity: Diverse Image Synthesis via Retrieval-Augmented Diffusion.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jul 26, 2025

Author Spotlight: An Efficient and Robust Software for Automated Fusion of Multiple Preclinical Imaging Modalities

Author Spotlight: An Efficient and Robust Software for Automated Fusion of Multiple Preclinical Imaging Modalities

Published on: October 27, 2023

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.

Chong Liu, Yuqi Zhang, Hongsong Wang

IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society

|June 20, 2023

Summary

This summary is machine-generated.

This study introduces a unified framework for image-text retrieval, combining coarse- and fine-grained representations. The Token-Guided Dual Transformer (TGDT) architecture improves retrieval accuracy and efficiency.

More Related Videos

Eye Tracking During Visually Situated Language Comprehension: Flexibility and Limitations in Uncovering Visual Context Effects

Eye Tracking During Visually Situated Language Comprehension: Flexibility and Limitations in Uncovering Visual Context Effects

Published on: November 30, 2018

Development of a Gaze-Contingent Display Framework Designed for Perceptual and Oculomotor Research with Simulated Central Vision Loss

Development of a Gaze-Contingent Display Framework Designed for Perceptual and Oculomotor Research with Simulated Central Vision Loss

Published on: April 11, 2025

Related Experiment Videos

Last Updated: Jul 26, 2025

Author Spotlight: An Efficient and Robust Software for Automated Fusion of Multiple Preclinical Imaging Modalities

Author Spotlight: An Efficient and Robust Software for Automated Fusion of Multiple Preclinical Imaging Modalities

Published on: October 27, 2023

Eye Tracking During Visually Situated Language Comprehension: Flexibility and Limitations in Uncovering Visual Context Effects

Eye Tracking During Visually Situated Language Comprehension: Flexibility and Limitations in Uncovering Visual Context Effects

Published on: November 30, 2018

Development of a Gaze-Contingent Display Framework Designed for Perceptual and Oculomotor Research with Simulated Central Vision Loss

Development of a Gaze-Contingent Display Framework Designed for Perceptual and Oculomotor Research with Simulated Central Vision Loss

Published on: April 11, 2025

Area of Science:

Computer Science
Artificial Intelligence
Machine Learning

Background:

Image-text retrieval is crucial for understanding semantic relationships between visual and linguistic data.
Existing methods often focus on either global or local features, neglecting their interplay, leading to suboptimal accuracy and high computational costs.

Purpose of the Study:

To develop a novel framework that integrates coarse- and fine-grained representation learning for enhanced image-text retrieval.
To improve retrieval accuracy and reduce computational complexity in multimodal understanding tasks.

Main Methods:

Proposed the Token-Guided Dual Transformer (TGDT) architecture with two homogeneous branches for image and text processing.
Introduced a Consistent Multimodal Contrastive (CMC) loss to ensure semantic consistency across modalities in a shared embedding space.
Implemented a two-stage inference method utilizing mixed global and local cross-modal similarity.

Main Results:

Achieved state-of-the-art retrieval performance on benchmark datasets.
Demonstrated significantly lower inference time compared to existing representative methods.
The unified framework effectively leverages both coarse- and fine-grained information.

Conclusions:

The proposed TGDT architecture offers a more effective and efficient approach to image-text retrieval by unifying multimodal representations.
The CMC loss and two-stage inference method contribute to superior semantic understanding and retrieval accuracy.
This work provides a new perspective on multimodal learning, aligning with human cognitive processes.