Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: May 5, 2026

Using Eye-tracking to Assess the Relative Importance of Visual and Vestibular Input to Subcortical Motion Processing in the Roll Plane
07:24

Using Eye-tracking to Assess the Relative Importance of Visual and Vestibular Input to Subcortical Motion Processing in the Roll Plane

Published on: August 22, 2025

607

A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos.

Yan Wang1, Yingchong Wang1, Xiuqi Zhang1

  • 1School of Mechanical Engineering, Beijing Institute of Technology, Haidian District, Beijing 100081, China.

Sensors (Basel, Switzerland)
|March 14, 2026
PubMed
Summary
This summary is machine-generated.

Related Concept Videos

Vision01:24

Vision

48.6K
Vision is the result of light being detected and transduced into neural signals by the retina of the eye. This information is then further analyzed and interpreted by the brain. First, light enters the front of the eye and is focused by the cornea and lens onto the retina—a thin sheet of neural tissue lining the back of the eye. Because of refraction through the convex lens of the eye, images are projected onto the retina upside-down and reversed.
48.6K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Identification and analysis of the AP2/ERF gene family in <i>Dendrobium officinale</i> based on pan-genome and functional characterization of <i>DofERF109_2</i>.

Frontiers in plant science·2026
Same author

Insufficient or excessive exercise activities are associated with suboptimal treatment outcomes in patients with psoriasis: a longitudinal study in shanghai, China.

Annals of medicine·2026
Same author

Efficacy and safety of MuShengshu in the treatment of mild-to-moderate atopic dermatitis: protocol for a randomized, double-blind, placebo-controlled trial.

Frontiers in medicine·2026
Same author

Nitric oxide dual-enhanced nanosystem boosts ferroptosis-chemotherapy synergy for tumor therapy.

Scientific reports·2026
Same author

Ginsenoside Ro ameliorates d-galactose-induced sarcopenia by modulating oxidative stress, inflammation, and gut microbiota in mice.

Phytomedicine : international journal of phytotherapy and phytopharmacology·2026
Same author

Metabolic-immune axis in pregnancy: Implications for women with autoimmune diseases.

Journal of reproductive immunology·2026
Same journal

Enhancing Unsupervised Multi-Source Domain Adaptation for Person Re-Identification via Mixture of Experts and Graph-Based Relation.

Sensors (Basel, Switzerland)·2026
Same journal

Development of an Instrumented Glove for Palmar Pressure Assessment in Kayakers.

Sensors (Basel, Switzerland)·2026
Same journal

Development and Experimental Validation of an Autonomous IoT-Based Monitoring System for Real-Time Water Quality Assessment in the Amazon River.

Sensors (Basel, Switzerland)·2026
Same journal

Semi-Supervised Adversarial Learning Framework for Controller Area Network Bus Intrusion Detection.

Sensors (Basel, Switzerland)·2026
Same journal

Smart Optimization Method for Safety Signs in Innovative Manufacturing Environments Integrating Industrial Field IoT Sensors and Knowledge Graphs.

Sensors (Basel, Switzerland)·2026
Same journal

Three-Dimensional Modeling and Performance Analysis of Dynamic mmWave V2I Networks Based on Stochastic Geometry.

Sensors (Basel, Switzerland)·2026
See all related articles

This study introduces a Vision-based Subtitle Generator (VSG) that converts sound-induced object vibrations into text. This novel approach uses phase-based motion estimation and a Transformer architecture for accurate speech recovery from visual data.

Area of Science:

  • Computer Vision
  • Acoustics
  • Signal Processing

Background:

  • Ambient sound, particularly speech, induces subtle vibrations in everyday objects.
  • These vibrations contain acoustic cues that can be potentially decoded into text.
  • Applications exist in monitoring and security.

Purpose of the Study:

  • To present the Vision-based Subtitle Generator (VSG).
  • To enable direct text recovery from high-speed videos of sound-induced object vibrations using a generative approach.
  • To reduce the dependency on large volumes of video data for training.

Main Methods:

  • Introduced a phase-based motion estimation (PME) technique, treating pixels as "independent microphones" to extract pseudo-acoustic signals.
  • Utilized a pretrained Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) as the encoder for the VSG-Transformer architecture.
Keywords:
phase-based motion estimation (PME)pretrained acoustic modeltext reconstruction from vibrationstransformer

Related Experiment Videos

Last Updated: May 5, 2026

Using Eye-tracking to Assess the Relative Importance of Visual and Vestibular Input to Subcortical Motion Processing in the Roll Plane
07:24

Using Eye-tracking to Assess the Relative Importance of Visual and Vestibular Input to Subcortical Motion Processing in the Roll Plane

Published on: August 22, 2025

607
  • Leveraged generative approach for vibration-to-text conversion.
  • Main Results:

    • Achieved character error rates of 13.7% (Base) and 12.5% (Large) for text generation from chip bag vibrations.
    • Demonstrated the effectiveness of the generative approach in vibration-to-text transcription.
    • Showcased robustness to lower sampling rates, maintaining performance with limited temporal sampling.

    Conclusions:

    • The VSG-Transformer effectively recovers text from sound-induced object vibrations.
    • The proposed methods significantly reduce the need for extensive video datasets.
    • The system shows promise for real-world applications in diverse acoustic environments.