Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Non-equilibrium in the Cell

Non-equilibrium in the Cell

An important concept in studying metabolism and energy is that of chemical equilibrium. Most chemical reactions are reversible. They can proceed in both directions, releasing energy into their environment in one direction, and absorbing it from the environment in the other direction. The same is true for the chemical reactions involved in cell metabolism, such as the breaking down and building up of proteins into and from individual amino acids, respectively. Reactants within a closed system...

Air-entraining Agents

Air-entraining Agents

Air-entraining agents improve the durability and workability of concrete in climates with frequent freezing and thawing. These agents prevent cracks by introducing small air bubbles into the mix, creating spaces accommodating water expansion when temperatures drop. The air-entraining agents lower the surface tension of water, forming stable, small air bubbles. This method is more effective than having accidental large voids, as the intentional, smaller, and evenly distributed air voids improve...

Amplifying Signals via Enzymatic Cascade

Amplifying Signals via Enzymatic Cascade

When a ligand binds to a cell-surface receptor, the receptor's intracellular domain changes shape, which may either activate its enzyme function or allow its binding to other molecules. The initial signal is amplified by most signal transduction pathways. This means that a single ligand molecule can activate multiple molecules of a downstream target. Proteins that relay a signal are most commonly phosphorylated at one or more sites, activating or inactivating the protein. Kinases catalyze...

Auditory Pathway

Auditory Pathway

Auditory pathways constitute the complex neural circuits responsible for transmitting and interpreting auditory information from the peripheral auditory system to the brain. Sound waves are initially captured by the outer ear, funneled through the ear canal, and reach the tympanic membrane (eardrum). These vibrations are transmitted via the middle ear's ossicles to the inner ear's cochlea.
When viewed cross-sectionally, the cochlea reveals the scala vestibuli and scala tympani flanking...

Elaborative Rehearsals

Elaborative Rehearsals

Elaborative rehearsal is a crucial cognitive strategy that strengthens information encoding in long-term memory by making meaningful connections between new data and pre-existing knowledge. This approach contrasts with maintenance rehearsal, which involves simple repetition without delving into the significance of the information. While maintenance rehearsal might temporarily keep information active in short-term memory, it is less effective for long-term retention.
The effectiveness of...

RACE - Rapid Amplification of cDNA Ends

RACE - Rapid Amplification of cDNA Ends

Rapid Amplification of cDNA Ends, or RACE, is one of the most effective methods to obtain a full-length cDNA from an mRNA sequence between a known internal region to the unknown sequence at the 5’ or 3’ end. The unknown region is cloned in the cDNA by a gene-specific primer that binds the known end, and a hybrid primer that attaches a predefined anchor sequence to the unknown end of the cDNA. The sequence in between is amplified by PCR with an anchor primer and a gene-specific...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Listening to MS: AI-assisted speech analysis for diagnosis and fatigue prediction (COMMITMENT).

Frontiers in digital health·2026

Same author

Phasor EO-FLIM: Lifetime imaging with picosecond noise and 500 Hz frame rate.

bioRxiv : the preprint server for biology·2026

Same author

Prior-aligned frequency-domain explanations for heart sound classification: a scale-consistent attribution approach.

Frontiers in artificial intelligence·2026

Same author

Application of indocyanine green fluorescence-guided laparoscopic hepatectomy in patients with liver metastases: a retrospective single‑center study.

BMC surgery·2026

Same author

Explainable detection of machine generated music and early systematic evaluation.

Scientific reports·2026

Same author

A frequency analysis of filterbank initialisation and noise augmentation for LEAF.

Scientific reports·2026

Same journal

Relaxed Stability Conditions for Model Predictive Control of Hybrid Dynamical Systems Using Hybrid Recurrent Neural Networks.

IEEE transactions on cybernetics·2026

Same journal

An Evolutionary Algorithm Assisted by an Ensemble of Pareto-Optimal Surrogate Models.

IEEE transactions on cybernetics·2026

Same journal

A Quantum Self-Attention Neural Network Model on Quantum Circuits.

IEEE transactions on cybernetics·2026

Same journal

Semi-Explicit Solution of Some Discrete-Time Higher-Order-Cost Mean-Field-Type Control.

IEEE transactions on cybernetics·2026

Same journal

A Novel One-Step Small Object Detector for Autonomous Aerial Vehicles.

IEEE transactions on cybernetics·2026

Same journal

Online Data-Driven-Based Optimal Output Tracking Control Without Initial Stabilizing Policy.

IEEE transactions on cybernetics·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Sep 26, 2025

Author Spotlight: Advancements in the Fabrication of Synthetic Vocal Fold Models for Phonetic and Robotic Applications

Author Spotlight: Advancements in the Fabrication of Synthetic Vocal Fold Models for Phonetic and Robotic Applications

Published on: January 5, 2024

End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks.

Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma

IEEE Transactions on Cybernetics

|April 19, 2022

Summary

This summary is machine-generated.

This study introduces an end-to-end video-to-speech model using generative adversarial networks (GANs). The novel approach directly synthesizes realistic speech waveforms from video, outperforming previous methods on benchmark datasets.

More Related Videos

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Related Experiment Videos

Last Updated: Sep 26, 2025

Author Spotlight: Advancements in the Fabrication of Synthetic Vocal Fold Models for Phonetic and Robotic Applications

Author Spotlight: Advancements in the Fabrication of Synthetic Vocal Fold Models for Phonetic and Robotic Applications

Published on: January 5, 2024

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Author Spotlight: Investigating the Impact of Emotional Prosodies on Voice Recognition and Perception

Published on: August 9, 2024

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Area of Science:

Artificial Intelligence
Speech Technology
Computer Vision

Background:

Traditional video-to-speech methods use multi-step processes with intermediate representations.
These methods often rely on separate vocoders or waveform reconstruction algorithms, limiting direct audio synthesis.

Purpose of the Study:

To develop a novel, end-to-end video-to-speech model.
To achieve direct waveform audio synthesis from raw video input without intermediate representations.

Main Methods:

An encoder-decoder architecture based on generative adversarial networks (GANs) was employed.
The model utilizes waveform and power critics with adversarial loss for direct audio synthesis.
Three comparative losses ensure correspondence between generated audio and input video.

Main Results:

The model successfully reconstructs speech with high realism on constrained datasets like GRID.
It is the first end-to-end model to generate intelligible speech for the challenging Lip Reading in the Wild (LRW) dataset.
Evaluations on seen and unseen speakers demonstrated superior performance across multiple objective metrics compared to prior work.

Conclusions:

The proposed end-to-end GAN-based video-to-speech model offers a significant advancement in direct waveform synthesis.
This approach achieves state-of-the-art results in speech reconstruction realism and intelligibility for both controlled and in-the-wild scenarios.