Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Multi-input and Multi-variable systems01:22

Multi-input and Multi-variable systems

106
Cruise control systems in cars are designed as multi-input systems to maintain a driver's desired speed while compensating for external disturbances such as changes in terrain. The block diagram for a cruise control system typically includes two main inputs: the desired speed set by the driver and any external disturbances, such as the incline of the road. By adjusting the engine throttle, the system maintains the vehicle's speed as close to the desired value as possible.
In the absence...
106
State Space Representation01:27

State Space Representation

208
The frequency-domain technique, commonly used in analyzing and designing feedback control systems, is effective for linear, time-invariant systems. However, it falls short when dealing with nonlinear, time-varying, and multiple-input multiple-output systems. The time-domain or state-space approach addresses these limitations by utilizing state variables to construct simultaneous, first-order differential equations, known as state equations, for an nth-order system.
Consider an RLC circuit, a...
208
Associative Learning01:27

Associative Learning

370
Associative learning is a fundamental concept in behavioral psychology, wherein a connection is established between two stimuli or events, leading to a learned response. This process is critical in understanding how behaviors are acquired and modified. Conditioning, the mechanism through which associations are formed, can be divided into two main types: classical conditioning and operant conditioning, each elucidating different aspects of associative learning.
Classical conditioning, also known...
370
Perceiving Loudness, Pitch, and Location01:21

Perceiving Loudness, Pitch, and Location

212
The human brain perceives pitch through two primary mechanisms reflected in place theory and frequency theory. Each mechanism describes how sound waves are interpreted as specific pitches by the brain, offering insights into the intricate processes of auditory perception.
Place theory, or place coding, suggests that different pitches are heard because various sound waves activate specific locations along the cochlea's basilar membrane. The brain determines the pitch of a sound by...
212
Control Volume and System Representations01:16

Control Volume and System Representations

1.2K
Two key frameworks are employed to analyze mass, energy, and momentum transfer: the control volume approach and the system approach. These frameworks offer different perspectives, depending on whether the focus is on a specific region in space (control volume approach) or a defined mass of fluid (system approach).
The control volume approach considers a stationary region in space through which fluid flows. This region is bounded by a control surface.  For instance, in the case of water...
1.2K
Chunking and Rehearsal in Sensory Memory01:22

Chunking and Rehearsal in Sensory Memory

213
Improving short-term memory can be achieved through techniques like chunking and rehearsal. Chunking involves organizing information into larger, more manageable units. This technique is particularly useful for information that exceeds the typical memory span of between five and nine items. For instance, logging into an online account with a password like "ta89vq0179gz" involves grouping letters and numbers into three chunks—ta89, vq01, and 79gz. It makes large amounts of...
213

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Integrating a Large Language Model Into a Socially Assistive Robot in a Hospital Geriatric Unit: Two-Wave Comparative Study on Performance, Engagement, and User Perceptions.

JMIR human factors·2025
Same author

Acceptability and Usability of a Socially Assistive Robot Integrated With a Large Language Model for Enhanced Human-Robot Interaction in a Geriatric Care Institution: Mixed Methods Evaluation.

JMIR human factors·2025
Same author

Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos.

IEEE transactions on pattern analysis and machine intelligence·2024
Same author

Combining GAN with reverse correlation to construct personalized facial expressions.

PloS one·2023
Same author

Virtual Reality-Assisted Awake Craniotomy: A Retrospective Study.

Cancers·2023
Same author

TransCenter: Transformers With Dense Representations for Multiple-Object Tracking.

IEEE transactions on pattern analysis and machine intelligence·2022
Same journal

Q-learning based asynchronous Boolean control networks stabilization with data loss.

Neural networks : the official journal of the International Neural Network Society·2026
Same journal

New results on prescribed-time synchronization of complex networks via intermittent control.

Neural networks : the official journal of the International Neural Network Society·2026
Same journal

Variance-constrained multi-view ensemble broad network for imbalanced data.

Neural networks : the official journal of the International Neural Network Society·2026
Same journal

Dynamic analysis and reliable mechanical optimization application of ring HNN effected with a memristive neuron.

Neural networks : the official journal of the International Neural Network Society·2026
Same journal

DAFF-Net: A detection and search method for small-scale low surface brightness galaxies.

Neural networks : the official journal of the International Neural Network Society·2026
Same journal

Quasi-synchronization for complex networks with hybrid pinning intermittent control.

Neural networks : the official journal of the International Neural Network Society·2026
See all related articles

Related Experiment Video

Updated: Jul 5, 2025

Author Spotlight: Advancing Large-Scale Neural Dynamics Through HD-MEA Technology
09:44

Author Spotlight: Advancing Large-Scale Neural Dynamics Through HD-MEA Technology

Published on: March 8, 2024

4.8K

A multimodal dynamical variational autoencoder for audiovisual speech representation learning.

Samir Sadok1, Simon Leglaive1, Laurent Girin2

  • 1CentraleSupélec IETR UMR CNRS 6164, France.

Neural Networks : the Official Journal of the International Neural Network Society
|January 24, 2024
PubMed
Summary
This summary is machine-generated.

This study introduces a multimodal and dynamical variational autoencoder (MDVAE) for unsupervised audiovisual speech representation learning. The MDVAE effectively disentangles and combines audio-visual information for improved emotion recognition.

Keywords:
Audiovisual speech processingDeep generative modelingDisentangled representation learningMultimodal and dynamical dataVariational autoencoder

More Related Videos

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception
05:48

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

1.5K
Cross-Modal Multivariate Pattern Analysis
13:51

Cross-Modal Multivariate Pattern Analysis

Published on: November 9, 2011

20.0K

Related Experiment Videos

Last Updated: Jul 5, 2025

Author Spotlight: Advancing Large-Scale Neural Dynamics Through HD-MEA Technology
09:44

Author Spotlight: Advancing Large-Scale Neural Dynamics Through HD-MEA Technology

Published on: March 8, 2024

4.8K
Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception
05:48

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

1.5K
Cross-Modal Multivariate Pattern Analysis
13:51

Cross-Modal Multivariate Pattern Analysis

Published on: November 9, 2011

20.0K

Area of Science:

  • Machine Learning
  • Artificial Intelligence
  • Signal Processing

Background:

  • High-dimensional data like speech have underlying regularities suggesting lower-dimensional latent representations.
  • Deep latent variable generative models, particularly Variational Autoencoders (VAEs), are effective for unsupervised representation learning.
  • Existing VAEs have been extended for multimodal and sequential data, but specialized models for audiovisual speech are needed.

Purpose of the Study:

  • To develop a novel Multimodal and Dynamical Variational Autoencoder (MDVAE) for unsupervised audiovisual speech representation learning.
  • To structure the latent space for disentangling static, dynamical, modality-specific, and modality-common factors in audiovisual speech.
  • To evaluate the MDVAE's effectiveness in combining audio-visual information and its application to emotion recognition.

Main Methods:

  • A two-stage unsupervised training approach was employed, starting with independent Vector Quantized VAEs (VQ-VAEs) for each modality.
  • The second stage involved training the MDVAE on intermediate representations to disentangle static/dynamical and shared/specific information.
  • Experiments included audiovisual speech manipulation, facial image denoising, and emotion recognition using the learned latent representations.

Main Results:

  • The MDVAE successfully learned a combined latent representation of audiovisual speech, effectively integrating audio and visual information.
  • The disentangled latent space allowed for the separation of static, dynamical, modality-specific, and modality-common factors.
  • The static latent representation achieved high accuracy in emotion recognition with limited labeled data, outperforming unimodal and transformer-based models.

Conclusions:

  • The proposed MDVAE offers a powerful framework for unsupervised audiovisual speech representation learning.
  • The model's ability to disentangle various latent factors enhances understanding and manipulation of audiovisual speech data.
  • MDVAE demonstrates significant potential for downstream tasks like emotion recognition, especially in low-data regimes.