Point-Based Learnable Query Generator for Human-Object Interaction Detection | JoVE Visualize

Area of Science:

Computer Vision
Artificial Intelligence
Machine Learning

Background:

Transformer-based and interaction point-based methods show promise in human-object interaction (HOI) detection.
Directly integrating these distinct model types is challenging due to structural and property differences.
Current Transformer HOI methods use separate decoders for instance detection and interaction recognition, limiting feature correlation.

Purpose of the Study:

To propose a novel Transformer-based HOI detection framework that enhances the intrinsic correlation between instance and action features.
To improve the accuracy of HOI detection by developing a more effective query generation mechanism.
To advance the state-of-the-art in human-object interaction detection.

Main Methods:

A novel Transformer-based HOI detection framework is proposed, featuring a decoder with three components: a learnable query generator, an instance decoder, and an interaction classifier.
The learnable query generator is designed to create effective queries, guiding the instance decoder and interaction classifier to learn accurate instance and interaction features.
The query generator incorporates prior bounding boxes, keypoint detection, and spatial relation features, inspired by interaction point-based methods.

Main Results:

The proposed framework demonstrates improved performance in human-object interaction detection.
Experimental validation on the HICO-DET and V-COCO datasets shows superior results compared to existing state-of-the-art methods.
The novel learnable query generator effectively enhances the learning of instance and interaction features.

Conclusions:

The proposed Transformer-based HOI detection framework successfully increases the intrinsic correlation between instance and action features.
The method achieves better performance on benchmark datasets, indicating its effectiveness and potential for real-world applications.
The integration of prior bounding boxes, keypoint detection, and spatial relation features in the query generator is a key contribution.