VPT: Video portraits transformer for realistic talking face generation
View abstract on PubMed
Summary
This summary is machine-generated.This study introduces a new talking face generation method, Video Portraits Transformer (VPT), for realistic videos with identity preservation and natural blinks. The framework improves audiovisual synchronization and facial details for applications like digital assistants.
Area Of Science
- Computer Vision
- Artificial Intelligence
- Machine Learning
Background
- Existing audio-driven talking face generation methods struggle with photo-realism, identity preservation, and natural facial details like blinks.
- Synchronization between audio and video is a key challenge in current talking face synthesis.
Purpose Of The Study
- To propose a novel talking face generation framework, Video Portraits Transformer (VPT), addressing limitations in realism, identity preservation, and blink synchronization.
- To enhance the synthesis of photo-realistic talking face videos with controllable and natural blink movements.
Main Methods
- The proposed Video Portraits Transformer (VPT) framework employs a two-stage process: audio-to-landmark and landmark-to-face.
- The audio-to-landmark stage utilizes a transformer encoder to predict facial landmarks from audio and Eye Aspect Ratio (EAR).
- The landmark-to-face stage uses a video-to-video (vid-to-vid) network for landmark-to-realistic video synthesis, incorporating a spontaneous blink generation module.
Main Results
- The VPT method successfully generates photo-realistic talking face videos with high identity preservation and accurate audiovisual synchronization.
- The spontaneous blink generation module effectively mimics real blink duration distribution and frequency, adding naturalness to the generated videos.
- Extensive experiments validate the framework's ability to produce high-quality talking face videos with natural blink movements.
Conclusions
- The Video Portraits Transformer (VPT) framework offers a significant advancement in audio-driven talking face generation, achieving superior realism and identity preservation.
- The integration of controllable blink movements enhances the naturalness and expressiveness of synthesized talking faces.
- This research contributes to more immersive and realistic digital interactions in applications like virtual assistants and video conferencing.

