A multimodal dynamical variational autoencoder for audiovisual speech representation learning | JoVE Visualize

Area of Science:

Machine Learning
Artificial Intelligence
Signal Processing

Background:

High-dimensional data like speech have underlying regularities suggesting lower-dimensional latent representations.
Deep latent variable generative models, particularly Variational Autoencoders (VAEs), are effective for unsupervised representation learning.
Existing VAEs have been extended for multimodal and sequential data, but specialized models for audiovisual speech are needed.

Purpose of the Study:

To develop a novel Multimodal and Dynamical Variational Autoencoder (MDVAE) for unsupervised audiovisual speech representation learning.
To structure the latent space for disentangling static, dynamical, modality-specific, and modality-common factors in audiovisual speech.
To evaluate the MDVAE's effectiveness in combining audio-visual information and its application to emotion recognition.

Main Methods:

A two-stage unsupervised training approach was employed, starting with independent Vector Quantized VAEs (VQ-VAEs) for each modality.
The second stage involved training the MDVAE on intermediate representations to disentangle static/dynamical and shared/specific information.
Experiments included audiovisual speech manipulation, facial image denoising, and emotion recognition using the learned latent representations.

Main Results:

The MDVAE successfully learned a combined latent representation of audiovisual speech, effectively integrating audio and visual information.
The disentangled latent space allowed for the separation of static, dynamical, modality-specific, and modality-common factors.
The static latent representation achieved high accuracy in emotion recognition with limited labeled data, outperforming unimodal and transformer-based models.

Conclusions:

The proposed MDVAE offers a powerful framework for unsupervised audiovisual speech representation learning.
The model's ability to disentangle various latent factors enhances understanding and manipulation of audiovisual speech data.
MDVAE demonstrates significant potential for downstream tasks like emotion recognition, especially in low-data regimes.