Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Perceiving Loudness, Pitch, and Location

Perceiving Loudness, Pitch, and Location

The human brain perceives pitch through two primary mechanisms reflected in place theory and frequency theory. Each mechanism describes how sound waves are interpreted as specific pitches by the brain, offering insights into the intricate processes of auditory perception.
Place theory, or place coding, suggests that different pitches are heard because various sound waves activate specific locations along the cochlea's basilar membrane. The brain determines the pitch of a sound by...

Multi-input and Multi-variable systems

Multi-input and Multi-variable systems

Cruise control systems in cars are designed as multi-input systems to maintain a driver's desired speed while compensating for external disturbances such as changes in terrain. The block diagram for a cruise control system typically includes two main inputs: the desired speed set by the driver and any external disturbances, such as the incline of the road. By adjusting the engine throttle, the system maintains the vehicle's speed as close to the desired value as possible.
In the absence...

Classification of Signals

Classification of Signals

In signal processing, signals are classified based on various characteristics: continuous-time versus discrete-time, periodic versus aperiodic, analog versus digital, and causal versus noncausal. Each category highlights distinct properties crucial for understanding and manipulating signals.
A continuous-time signal holds a value at every instant in time, representing information seamlessly. In contrast, a discrete-time signal holds values only at specific moments, often denoted as x(n), where...

Signal and System

Signal and System

A signal x(t) is a set of data or a time function representing a variable of interest. Signals typically convey information about a phenomenon, such as atmospheric temperature, humidity, human voice, television images, a dog's bark, or birdsongs. More generally, a signal can be a function of more than one independent variable. For instance, images depend on horizontal and vertical positions and can be regarded as two-dimensional signals. However, this text will focus on one-dimensional...

Perception of Sound Waves

Perception of Sound Waves

The human ear is not equally sensitive to all frequencies in the audible range. It may perceive sound waves with the same pressure but different frequencies as having different loudness. Moreover, the perception of sound waves depends on the health of an individual's ears, which decays with age. The health of one's ears may also be affected by regular exposure to loud noises.
The pitch of a sound depends on the frequency and the pressure amplitude of the source. Two sounds of the same...

Classification of Systems-II

Classification of Systems-II

Continuous-time systems have continuous input and output signals, with time measured continuously. These systems are generally defined by differential or algebraic equations. For instance, in an RC circuit, the relationship between input and output voltage is expressed through a differential equation derived from Ohm's law and the capacitor relation,

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Hybrid Deep Neural Network Framework Combining Skeleton and Gait Features for Pathological Gait Recognition.

Bioengineering (Basel, Switzerland)·2023

Same author

A Deep Learning-Based Semantic Segmentation Model Using MCNN and Attention Layer for Human Activity Recognition.

Sensors (Basel, Switzerland)·2023

Same author

Deep-Learning-Based ADHD Classification Using Children's Skeleton Data Acquired through the ADHD Screening Game.

Sensors (Basel, Switzerland)·2023

Same author

Deep Learning-Based ADHD and ADHD-RISK Classification Technology through the Recognition of Children's Abnormal Behaviors during the Robot-Led ADHD Screening Game.

Sensors (Basel, Switzerland)·2023

Same author

A Low-Cost Foot-Placed UWB and IMU Fusion-Based Indoor Pedestrian Tracking System for IoT Applications.

Sensors (Basel, Switzerland)·2022

Same author

Markerless 3D Skeleton Tracking Algorithm by Merging Multiple Inaccurate Skeleton Data from Multiple RGB-D Sensors.

Sensors (Basel, Switzerland)·2022

Same journal

RETRACTED: Zhang et al. A Novel Framework for Reconstruction and Imaging of Target Scattering Centers via Wide-Angle Incidence in Radar Networks. <i>Sensors</i> 2025, <i>25</i>, 6802.

Sensors (Basel, Switzerland)·2026

Same journal

Enhancing Unsupervised Multi-Source Domain Adaptation for Person Re-Identification via Mixture of Experts and Graph-Based Relation.

Sensors (Basel, Switzerland)·2026

Same journal

Development of an Instrumented Glove for Palmar Pressure Assessment in Kayakers.

Sensors (Basel, Switzerland)·2026

Same journal

Development and Experimental Validation of an Autonomous IoT-Based Monitoring System for Real-Time Water Quality Assessment in the Amazon River.

Sensors (Basel, Switzerland)·2026

Same journal

Semi-Supervised Adversarial Learning Framework for Controller Area Network Bus Intrusion Detection.

Sensors (Basel, Switzerland)·2026

Same journal

Smart Optimization Method for Safety Signs in Innovative Manufacturing Environments Integrating Industrial Field IoT Sensors and Knowledge Graphs.

Sensors (Basel, Switzerland)·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Aug 23, 2025

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications.

Sanghun Jeon¹, Mun Sang Kim¹

¹Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea.

Sensors (Basel, Switzerland)

|October 27, 2022

Summary

This summary is machine-generated.

This study introduces a robust speech recognition system for noisy environments by combining audio and visual data. Multimodal input significantly improves recognition accuracy, making systems more reliable in real-world applications.

Keywords:

audiovisual speech recognition deep learning edutainment lipreading multimodal interaction virtual aquarium

More Related Videos

Asthma Detection Research Based on Voice Signal Processing and Machine Learning

Asthma Detection Research Based on Voice Signal Processing and Machine Learning

Published on: July 22, 2025

A Protocol for Comprehensive Assessment of Bulbar Dysfunction in Amyotrophic Lateral Sclerosis ALS

A Protocol for Comprehensive Assessment of Bulbar Dysfunction in Amyotrophic Lateral Sclerosis ALS

Published on: February 21, 2011

Related Experiment Videos

Last Updated: Aug 23, 2025

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

Asthma Detection Research Based on Voice Signal Processing and Machine Learning

Asthma Detection Research Based on Voice Signal Processing and Machine Learning

Published on: July 22, 2025

A Protocol for Comprehensive Assessment of Bulbar Dysfunction in Amyotrophic Lateral Sclerosis ALS

A Protocol for Comprehensive Assessment of Bulbar Dysfunction in Amyotrophic Lateral Sclerosis ALS

Published on: February 21, 2011

Area of Science:

Human-Computer Interaction
Artificial Intelligence
Signal Processing

Background:

Speech recognition is crucial for edutainment systems but struggles in noisy real-world environments.
Ambient noise significantly limits the effectiveness of single-mode speech interaction systems.
Existing systems lack robustness, hindering seamless user-system interaction in diverse settings.

Purpose of the Study:

To develop a noise-robust, multimodal interaction system for virtual aquarium environments using speech.
To enhance speech recognition accuracy by integrating audio and visual information.
To improve the reliability of speech-based interaction in real-world applications.

Main Methods:

Proposed a multimodal system combining audio-based speech recognition (using speech API and pretrained word vectors) and vision-based speech recognition (using a deep neural network).
Concatenated vectors from both audio and visual modalities for classification.
Evaluated the system's signal-to-noise ratio in four noise environments and compared its accuracy and efficiency against single-mode approaches.

Main Results:

The multimodal system achieved an average recognition rate of 98.12%, a 6.7% improvement over speech-only recognition (91.42%).
Demonstrated superior accuracy and efficiency compared to existing single-mode visual feature extraction and audio speech recognition methods.
The system proved robust across various noise environments.

Conclusions:

Combining audio and visual information significantly enhances speech recognition robustness and accuracy in noisy conditions.
The proposed multimodal approach offers a viable solution for reliable speech interaction in diverse real-world settings.
This technology has broad applicability in public spaces like cafés, museums, and kiosks.