Language-Driven Visual Data Generation for Zero-Shot HOI Detection | JoVE Visualize

Area of Science:

Computer Vision
Artificial Intelligence
Machine Learning

Background:

Zero-shot human-object interaction (HOI) detection aims to identify interactions between humans and objects, including categories not seen during training.
Existing methods struggle with unseen HOI categories due to a lack of training data, leading to overfitting on seen interactions and poor generalization.
This limitation hinders the practical application of HOI detection in diverse, real-world scenarios.

Purpose of the Study:

To develop a novel approach for zero-shot HOI detection that effectively generalizes to unseen interaction categories.
To overcome the data scarcity problem for unseen HOIs by leveraging textual information.
To enhance the performance of HOI detection systems on both seen and unseen interaction types.

Main Methods:

Introduced a Language-Driven Visual Data Generation (LD-VDG) approach to create pseudo visual features from textual semantics of unseen HOIs.
Designed a text-to-vision (T-V) adapter, trained on seen HOIs, to align text and visual features.
Utilized a large language model to generate fine-grained textual descriptions for unseen HOIs, which were then transformed into pseudo visual features via the T-V adapter.

Main Results:

The generated pseudo visual features, combined with real features from seen HOIs, were used to train a transformer-based HOI detector.
Experimental results on standard datasets demonstrated that LD-VDG significantly outperforms previous methods in zero-shot HOI detection.
The method achieved superior performance specifically on unseen HOI categories across various zero-shot settings.

Conclusions:

The proposed LD-VDG approach offers an innovative solution for generalizing to unseen HOIs by generating language-driven visual representations.
This method effectively addresses the challenge of data scarcity for unseen categories in HOI detection.
LD-VDG demonstrates the potential of leveraging textual semantics to enhance visual recognition tasks, particularly in zero-shot learning scenarios.