Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

The Cochlea01:13

The Cochlea

46.5K
The cochlea is a coiled structure in the inner ear that contains hair cells—the sensory receptors of the auditory system. Sound waves are transmitted to the cochlea by small bones attached to the eardrum called the ossicles, which vibrate the oval window that leads to the inner ear. This causes fluid in the chambers of the cochlea to move, vibrating the basilar membrane.
46.5K
Depth Perception and Spatial Vision01:15

Depth Perception and Spatial Vision

1.1K
Depth perception is the ability to perceive objects three-dimensionally. It relies on two types of cues: binocular and monocular. Binocular cues depend on the combination of images from both eyes and how the eyes work together. Since the eyes are in slightly different positions, each eye captures a slightly different image. This disparity between images, known as binocular disparity, helps the brain interpret depth. When the brain compares these images, it determines the distance to an object.
1.1K
Perceiving Loudness, Pitch, and Location01:21

Perceiving Loudness, Pitch, and Location

464
The human brain perceives pitch through two primary mechanisms reflected in place theory and frequency theory. Each mechanism describes how sound waves are interpreted as specific pitches by the brain, offering insights into the intricate processes of auditory perception.
Place theory, or place coding, suggests that different pitches are heard because various sound waves activate specific locations along the cochlea's basilar membrane. The brain determines the pitch of a sound by...
464
Hearing01:31

Hearing

53.5K
When we hear a sound, our nervous system is detecting sound waves—pressure waves of mechanical energy traveling through a medium. The frequency of the wave is perceived as pitch, while the amplitude is perceived as loudness.
53.5K
The Vestibular System01:29

The Vestibular System

40.4K
The vestibular system is a set of inner ear structures that provide a sense of balance and spatial orientation. This system is comprised of structures within the labyrinth of the inner ear, including the cochlea and two otolith organs—the utricle and saccule. The labyrinth also contains three semicircular canals—superior, posterior, and horizontal—that are oriented on different planes.
40.4K
Auditory Pathway01:15

Auditory Pathway

5.9K
Auditory pathways constitute the complex neural circuits responsible for transmitting and interpreting auditory information from the peripheral auditory system to the brain. Sound waves are initially captured by the outer ear, funneled through the ear canal, and reach the tympanic membrane (eardrum). These vibrations are transmitted via the middle ear's ossicles to the inner ear's cochlea.
When viewed cross-sectionally, the cochlea reveals the scala vestibuli and scala tympani flanking...
5.9K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

PixOOD: Pixel-Level Out-of-Distribution Detection.

IEEE transactions on pattern analysis and machine intelligence·2026
Same author

Learning From Each Other: Generalized Federated Incremental Semantic Segmentation.

IEEE transactions on pattern analysis and machine intelligence·2026
Same author

ACDC: The Adverse Conditions Dataset With Correspondences for Robust Semantic Driving Scene Perception.

IEEE transactions on pattern analysis and machine intelligence·2025
Same author

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler.

IEEE transactions on pattern analysis and machine intelligence·2025
Same author

Subgrapher: visual fingerprinting of chemical structures.

Journal of cheminformatics·2025
Same author

Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis.

IEEE transactions on pattern analysis and machine intelligence·2025
Same journal

Relation DETR+: Exploring Explicit Position Relation Prior for Dense Prediction.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

CAFE: Cross-View Adaptive Fusion and Cluster Center Enhancement for Robust Multi-View Clustering.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving.

IEEE transactions on pattern analysis and machine intelligence·2026
Same journal

Learning Shape Anchors for Holistic Indoor Scene Understanding.

IEEE transactions on pattern analysis and machine intelligence·2026
See all related articles

Related Experiment Video

Updated: Oct 1, 2025

MPI CyberMotion Simulator: Implementation of a Novel Motion Simulator to Investigate Multisensory Path Integration in Three Dimensions
09:46

MPI CyberMotion Simulator: Implementation of a Novel Motion Simulator to Investigate Multisensory Path Integration in Three Dimensions

Published on: May 10, 2012

12.8K

Binaural SoundNet: Predicting Semantics, Depth and Motion With Binaural Sounds.

Dengxin Dai, Arun Balajee Vasudevan, Jiri Matas

    IEEE Transactions on Pattern Analysis and Machine Intelligence
    |March 3, 2022
    PubMed
    Summary
    This summary is machine-generated.

    This study introduces a novel approach for machine scene understanding using only binaural sounds. The method enables machines to identify object semantics, motion, and depth from audio, advancing auditory perception capabilities.

    More Related Videos

    A Method to Study Adaptation to Left-Right Reversed Audition
    07:14

    A Method to Study Adaptation to Left-Right Reversed Audition

    Published on: October 29, 2018

    6.6K
    Sound Source Localization Testing in Single-sided Deafness Following Bone Conduction Intervention
    04:32

    Sound Source Localization Testing in Single-sided Deafness Following Bone Conduction Intervention

    Published on: December 20, 2024

    486

    Related Experiment Videos

    Last Updated: Oct 1, 2025

    MPI CyberMotion Simulator: Implementation of a Novel Motion Simulator to Investigate Multisensory Path Integration in Three Dimensions
    09:46

    MPI CyberMotion Simulator: Implementation of a Novel Motion Simulator to Investigate Multisensory Path Integration in Three Dimensions

    Published on: May 10, 2012

    12.8K
    A Method to Study Adaptation to Left-Right Reversed Audition
    07:14

    A Method to Study Adaptation to Left-Right Reversed Audition

    Published on: October 29, 2018

    6.6K
    Sound Source Localization Testing in Single-sided Deafness Following Bone Conduction Intervention
    04:32

    Sound Source Localization Testing in Single-sided Deafness Following Bone Conduction Intervention

    Published on: December 20, 2024

    486

    Area of Science:

    • Computer Vision
    • Machine Learning
    • Acoustics

    Background:

    • Humans excel at scene understanding using visual and auditory cues, but machines primarily rely on visual data.
    • Developing machine capabilities for sound-based scene understanding remains an underexplored area.

    Purpose of the Study:

    • To develop a machine learning approach for scene understanding using only binaural audio.
    • To enable machines to predict semantic masks, motion, and depth maps of sound-making objects from audio.
    • To create a new audio-visual dataset of street scenes for training and evaluation.

    Main Methods:

    • A novel sensor setup with eight binaural microphones and a 360° camera was used to record a new street scene dataset.
    • A cross-modal distillation framework transferred knowledge from vision 'teacher' models to a sound 'student' model, enabling training without human annotations.
    • An auxiliary task, Spatial Sound Super-Resolution, was introduced to enhance sound directional resolution.

    Main Results:

    • The proposed multi-tasking network achieved good performance across all four tasks: semantic mask prediction, motion estimation, depth mapping, and sound super-resolution.
    • Jointly training the four tasks proved mutually beneficial, leading to the best overall performance.
    • Microphone configuration (number and orientation) significantly impacts performance.
    • Complementary features from standard spectrograms and classic signal processing pipelines enhance auditory perception.

    Conclusions:

    • The developed approach demonstrates the potential of purely audio-based scene understanding for machines.
    • Multi-task learning and specialized audio processing techniques like Spatial Sound Super-Resolution are effective for improving auditory perception.
    • The new dataset and framework facilitate further research in sound-based scene understanding.