Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

The Cochlea

The Cochlea

The cochlea is a coiled structure in the inner ear that contains hair cells—the sensory receptors of the auditory system. Sound waves are transmitted to the cochlea by small bones attached to the eardrum called the ossicles, which vibrate the oval window that leads to the inner ear. This causes fluid in the chambers of the cochlea to move, vibrating the basilar membrane.

Depth Perception and Spatial Vision

Depth Perception and Spatial Vision

Depth perception is the ability to perceive objects three-dimensionally. It relies on two types of cues: binocular and monocular. Binocular cues depend on the combination of images from both eyes and how the eyes work together. Since the eyes are in slightly different positions, each eye captures a slightly different image. This disparity between images, known as binocular disparity, helps the brain interpret depth. When the brain compares these images, it determines the distance to an object.

Perceiving Loudness, Pitch, and Location

Perceiving Loudness, Pitch, and Location

The human brain perceives pitch through two primary mechanisms reflected in place theory and frequency theory. Each mechanism describes how sound waves are interpreted as specific pitches by the brain, offering insights into the intricate processes of auditory perception.
Place theory, or place coding, suggests that different pitches are heard because various sound waves activate specific locations along the cochlea's basilar membrane. The brain determines the pitch of a sound by...

Hearing

Hearing

When we hear a sound, our nervous system is detecting sound waves—pressure waves of mechanical energy traveling through a medium. The frequency of the wave is perceived as pitch, while the amplitude is perceived as loudness.

The Vestibular System

The Vestibular System

The vestibular system is a set of inner ear structures that provide a sense of balance and spatial orientation. This system is comprised of structures within the labyrinth of the inner ear, including the cochlea and two otolith organs—the utricle and saccule. The labyrinth also contains three semicircular canals—superior, posterior, and horizontal—that are oriented on different planes.

Auditory Pathway

Auditory Pathway

Auditory pathways constitute the complex neural circuits responsible for transmitting and interpreting auditory information from the peripheral auditory system to the brain. Sound waves are initially captured by the outer ear, funneled through the ear canal, and reach the tympanic membrane (eardrum). These vibrations are transmitted via the middle ear's ossicles to the inner ear's cochlea.
When viewed cross-sectionally, the cochlea reveals the scala vestibuli and scala tympani flanking...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

PixOOD: Pixel-Level Out-of-Distribution Detection.

IEEE transactions on pattern analysis and machine intelligence·2026

Same author

Learning From Each Other: Generalized Federated Incremental Semantic Segmentation.

IEEE transactions on pattern analysis and machine intelligence·2026

Same author

ACDC: The Adverse Conditions Dataset With Correspondences for Robust Semantic Driving Scene Perception.

IEEE transactions on pattern analysis and machine intelligence·2025

Same author

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler.

IEEE transactions on pattern analysis and machine intelligence·2025

Same author

Subgrapher: visual fingerprinting of chemical structures.

Journal of cheminformatics·2025

Same author

Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis.

IEEE transactions on pattern analysis and machine intelligence·2025

Same journal

Relation DETR+: Exploring Explicit Position Relation Prior for Dense Prediction.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

CAFE: Cross-View Adaptive Fusion and Cluster Center Enhancement for Robust Multi-View Clustering.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving.

IEEE transactions on pattern analysis and machine intelligence·2026

Same journal

Learning Shape Anchors for Holistic Indoor Scene Understanding.

IEEE transactions on pattern analysis and machine intelligence·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Oct 1, 2025

MPI CyberMotion Simulator: Implementation of a Novel Motion Simulator to Investigate Multisensory Path Integration in Three Dimensions

MPI CyberMotion Simulator: Implementation of a Novel Motion Simulator to Investigate Multisensory Path Integration in Three Dimensions

Published on: May 10, 2012

Binaural SoundNet: Predicting Semantics, Depth and Motion With Binaural Sounds.

Dengxin Dai, Arun Balajee Vasudevan, Jiri Matas

IEEE Transactions on Pattern Analysis and Machine Intelligence

|March 3, 2022

Summary

This summary is machine-generated.

This study introduces a novel approach for machine scene understanding using only binaural sounds. The method enables machines to identify object semantics, motion, and depth from audio, advancing auditory perception capabilities.

More Related Videos

A Method to Study Adaptation to Left-Right Reversed Audition

A Method to Study Adaptation to Left-Right Reversed Audition

Published on: October 29, 2018

Sound Source Localization Testing in Single-sided Deafness Following Bone Conduction Intervention

Sound Source Localization Testing in Single-sided Deafness Following Bone Conduction Intervention

Published on: December 20, 2024

Related Experiment Videos

Last Updated: Oct 1, 2025

MPI CyberMotion Simulator: Implementation of a Novel Motion Simulator to Investigate Multisensory Path Integration in Three Dimensions

MPI CyberMotion Simulator: Implementation of a Novel Motion Simulator to Investigate Multisensory Path Integration in Three Dimensions

Published on: May 10, 2012

A Method to Study Adaptation to Left-Right Reversed Audition

A Method to Study Adaptation to Left-Right Reversed Audition

Published on: October 29, 2018

Sound Source Localization Testing in Single-sided Deafness Following Bone Conduction Intervention

Sound Source Localization Testing in Single-sided Deafness Following Bone Conduction Intervention

Published on: December 20, 2024

Area of Science:

Computer Vision
Machine Learning
Acoustics

Background:

Humans excel at scene understanding using visual and auditory cues, but machines primarily rely on visual data.
Developing machine capabilities for sound-based scene understanding remains an underexplored area.

Purpose of the Study:

To develop a machine learning approach for scene understanding using only binaural audio.
To enable machines to predict semantic masks, motion, and depth maps of sound-making objects from audio.
To create a new audio-visual dataset of street scenes for training and evaluation.

Main Methods:

A novel sensor setup with eight binaural microphones and a 360° camera was used to record a new street scene dataset.
A cross-modal distillation framework transferred knowledge from vision 'teacher' models to a sound 'student' model, enabling training without human annotations.
An auxiliary task, Spatial Sound Super-Resolution, was introduced to enhance sound directional resolution.

Main Results:

The proposed multi-tasking network achieved good performance across all four tasks: semantic mask prediction, motion estimation, depth mapping, and sound super-resolution.
Jointly training the four tasks proved mutually beneficial, leading to the best overall performance.
Microphone configuration (number and orientation) significantly impacts performance.
Complementary features from standard spectrograms and classic signal processing pipelines enhance auditory perception.

Conclusions:

The developed approach demonstrates the potential of purely audio-based scene understanding for machines.
Multi-task learning and specialized audio processing techniques like Spatial Sound Super-Resolution are effective for improving auditory perception.
The new dataset and framework facilitate further research in sound-based scene understanding.