Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Multi-input and Multi-variable systems

Multi-input and Multi-variable systems

Cruise control systems in cars are designed as multi-input systems to maintain a driver's desired speed while compensating for external disturbances such as changes in terrain. The block diagram for a cruise control system typically includes two main inputs: the desired speed set by the driver and any external disturbances, such as the incline of the road. By adjusting the engine throttle, the system maintains the vehicle's speed as close to the desired value as possible.
In the absence of...

Perceiving Loudness, Pitch, and Location

Perceiving Loudness, Pitch, and Location

The human brain perceives pitch through two primary mechanisms reflected in place theory and frequency theory. Each mechanism describes how sound waves are interpreted as specific pitches by the brain, offering insights into the intricate processes of auditory perception.
Place theory, or place coding, suggests that different pitches are heard because various sound waves activate specific locations along the cochlea's basilar membrane. The brain determines the pitch of a sound by...

Linear Approximation in Frequency Domain

Linear Approximation in Frequency Domain

Linear systems are characterized by two main properties: superposition and homogeneity. Superposition allows the response to multiple inputs to be the sum of the responses to each individual input. Homogeneity ensures that scaling an input by a scalar results in the response being scaled by the same scalar.
In contrast, nonlinear systems do not inherently possess these properties. However, for small deviations around an operating point, a nonlinear system can often be approximated as linear....

Sampling Continuous Time Signal

Sampling Continuous Time Signal

In signal processing, a continuous-time signal can be sampled using an impulse-train sampling technique, followed by the zero-order hold method. Impulse-train sampling involves the use of a periodic impulse train, which consists of a series of delta functions spaced at regular intervals determined by the sampling period. When a continuous-time signal is multiplied by this impulse train, it generates impulses with amplitudes corresponding to the signal's values at the sampling points.
In the...

Facial Feedback Hypothesis

Facial Feedback Hypothesis

Charles Darwin proposed that facial expressions are an evolutionary adaptation for communication. He argued that these expressions are not influenced by culture but are universal across species. For example, a snarling expression with exposed teeth signals a threat in many animals, including humans. Darwin also suggested that displaying an emotion can intensify the feeling. Smiling, for example, could enhance one's sense of happiness. This idea laid the foundation for understanding the role...

Sampling Plans

Sampling Plans

Sampling is a crucial step in analytical chemistry, allowing researchers to collect representative data from a large population. Common sampling methods include random, judgmental, systematic, stratified, and cluster sampling.
Random sampling is a method where each member of the population has an equal chance of being selected for the sample. It involves selecting individuals randomly, often using random number generators or lottery-type methods. For example, when analyzing the properties of a...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Learning 3-D Ultrasound Segmentation under Extreme Label Deficiency.

Ultrasound in medicine & biology·2026

Same author

Integrating a Large Language Model Into a Socially Assistive Robot in a Hospital Geriatric Unit: Two-Wave Comparative Study on Performance, Engagement, and User Perceptions.

JMIR human factors·2025

Same author

Enhancing surgical object detection in laparoscopic cholecystectomy with explicit positional relationship modeling.

Computational and structural biotechnology journal·2025

Same author

Acceptability and Usability of a Socially Assistive Robot Integrated With a Large Language Model for Enhanced Human-Robot Interaction in a Geriatric Care Institution: Mixed Methods Evaluation.

JMIR human factors·2025

Same author

Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos.

IEEE transactions on pattern analysis and machine intelligence·2024

Same author

A multimodal dynamical variational autoencoder for audiovisual speech representation learning.

Neural networks : the official journal of the International Neural Network Society·2024

Same journal

Relation DETR+: Exploring Explicit Position Relation Prior for Dense Prediction.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

CAFE: Cross-View Adaptive Fusion and Cluster Center Enhancement for Robust Multi-View Clustering.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Learning Shape Anchors for Holistic Indoor Scene Understanding.

IEEE transactions on pattern analysis and machine intelligence·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jan 3, 2026

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers.

Yutong Ban, Xavier Alameda-Pineda, Laurent Girin

IEEE Transactions on Pattern Analysis and Machine Intelligence

|November 22, 2019

Summary

This summary is machine-generated.

This study introduces an audio-visual fusion model for tracking multiple speakers, enhancing accuracy and handling missing data. The generative model accurately estimates speaker trajectories and acoustic status in real-world meetings.

More Related Videos

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

Published on: January 18, 2020

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published on: October 13, 2018

Related Experiment Videos

Last Updated: Jan 3, 2026

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

A Methodology for Capturing Joint Visual Attention Using Mobile Eye-Trackers

Published on: January 18, 2020

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published on: October 13, 2018

Area of Science:

Computer Vision
Machine Learning
Signal Processing

Background:

Accurate tracking of multiple speakers is crucial for applications like meeting analysis and human-computer interaction.
Existing methods often struggle with modality absence or accurately estimating speaker activity.
Integrating visual and auditory cues offers complementary information for robust tracking.

Purpose of the Study:

To develop a novel audio-visual fusion model for multi-speaker tracking.
To accurately estimate speaker trajectories and speaking status.
To robustly handle temporary missing visual or auditory data.

Main Methods:

A generative audio-visual fusion model formulated as a latent-variable temporal graphical model.
Variational inference to approximate intractable posterior distributions.
A closed-form expectation-maximization procedure for parameter estimation.

Main Results:

The proposed model accurately estimates smooth trajectories of multiple speakers.
It effectively handles short periods of missing audio or visual information.
The system successfully estimates the speaking status (speaking/silent) of each individual.
Performance evaluation shows superior results compared to baseline methods in informal meeting scenarios.

Conclusions:

The proposed audio-visual fusion approach provides a robust and accurate solution for multi-speaker tracking.
The model's ability to integrate complementary modalities and handle data absence is a key strength.
This method shows significant promise for real-world applications involving dynamic group interactions.