From external to internal: Step-wise feature enhancement network for image-text retrieval
View abstract on PubMed
Summary
This summary is machine-generated.This study introduces a Step-wise Feature Enhancement (SFE) Network to improve image-text retrieval by capturing richer semantic cues. The SFE network effectively bridges the heterogeneity gap for better cross-modal understanding.
Area Of Science
- Computer Science
- Artificial Intelligence
- Machine Learning
Background
- Image-Text Retrieval (ITR) faces challenges due to the heterogeneity gap between image and text modalities.
- Existing ITR methods struggle to capture comprehensive semantic cues from large-scale image-text corpora beyond individual pairs.
- Bridging the heterogeneity gap requires stronger associations and comprehensive semantic cue capture.
Purpose Of The Study
- To propose a novel two-layer Step-wise Feature Enhancement (SFE) Network to address the limitations of current ITR methods.
- To establish a semantic propagation pathway for progressive information flow from external to internal layers.
- To enhance the association between images and texts by capturing both external and internal semantic cues.
Main Methods
- The proposed SFE network utilizes a two-layer approach for feature enhancement.
- Step 1 captures External Semantic Cues (ESC) from patch-level, instance-level, and neighbor-level co-occurrences to enhance features in the external layer.
- Step 2 fuses propagated semantic information and enhances features in the internal layer by mining Internal Semantic Cues (ISC) through cross-modal context.
Main Results
- The SFE network effectively captures external semantic cues through multi-level co-occurrences, including cross-modal instance and neighbor levels.
- Internal semantic cues are mined via cross-modal context, further enhancing visual and textual features within the internal layer.
- Experimental results show the proposed SFE network outperforms state-of-the-art ITR methods.
Conclusions
- The SFE network successfully establishes a semantic propagation pathway, progressively guiding semantic information flow.
- The method effectively bridges the heterogeneity gap by capturing comprehensive semantic cues from large-scale corpora.
- The proposed approach demonstrates superior performance in image-text retrieval tasks.

