Task-aware cross-modal refinement and liquid fusion for text-visual grounding | JoVE Visualize

Area of Science:

Computer Vision
Artificial Intelligence
Machine Learning

Background:

Visual grounding, crucial for autonomous driving and human-robot interaction, faces challenges like semantic gaps between modalities, large model parameters, and insufficient cross-modal attention.
Existing models often process visual and textual data independently, leading to feature discrepancies and hindering performance on lightweight devices.
Single-level attention mechanisms limit the ability to capture complex interactions between image and text features.

Purpose of the Study:

To propose an efficient and lightweight Task-aware Liquid Cross-modal Network (TLCN) to address the limitations of current visual grounding models.
To reduce the semantic gap between visual and textual features through guided feature extraction.
To decrease model parameters for improved deployment on resource-constrained devices.

Main Methods:

The TLCN utilizes a Feature Extraction Module (FEM) where text guides visual feature extraction, minimizing the semantic gap.
A Liquid Fusion Module (LFM) employing Liquid Neural Networks (LNNs) captures temporal dependencies and reduces model parameters.
A Task-aware Cross-modal Refinement Module (TCRM) with second-level attention and Conv-Trans Blocks (CTBs) deepens feature representation and captures cross-modal interactions, optimized with KL divergence loss.

Main Results:

The TLCN demonstrated superior performance on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks.
The model also achieved excellent results on a specialized text localization task.
Experimental validation confirmed the effectiveness of the proposed modules in improving visual grounding accuracy.

Conclusions:

The TLCN effectively bridges the semantic gap via text-guided visual feature extraction.
The integration of LNNs significantly reduces model parameters, enabling lightweight deployment.
The proposed architecture successfully captures deep cross-modal interactions, providing a robust solution for visual grounding.