Decoupling upper and lower face transformers for binary interactive video generation.

Daowu Yang , Ying Liu , Qiyun Yang , Ruihui Li

Neural Netw

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China. Electronic address:

Published: November 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Current audio-driven binary interaction methods have limitations in capturing the uncertain relationship between a speaker's audio and an interlocutor's facial movements. To address this issue, we propose a video generation pipeline based on a cross-modal Transformer. First, a Transformer decoder partitions facial features into upper and lower regions, capturing lower features that are closely linked to the audio and upper features that remain independent of visual cues. Second, we design a cross-modal attention module that combines alignment bias with causal attention mechanisms to effectively manage subtle motion variations between adjacent frames in facial sequences. To mitigate uncertainties in long-term contexts, we expand the self-attention range of the Transformer encoder and integrate self-supervised pretrained speech representations to alleviate data scarcity. Finally, by optimizing the audio-to-action mapping and incorporating an enhanced neural renderer, we achieve fine control over facial movements while generating high-quality portrait images. Extensive experiments validate the effectiveness and superiority of our approach in interactive video generation.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.neunet.2025.107714	DOI Listing

Publication Analysis

Top Keywords

video generation

upper lower

interactive video

facial movements

decoupling upper

lower face

face transformers

transformers binary

binary interactive

generation current

A PHP Error was encountered