A PHP Error was encountered

Severity: Warning

Message: file_get_contents(https://...@gmail.com&api_key=61f08fa0b96a73de8c900d749fcb997acc09&a=1): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests

Filename: helpers/my_audit_helper.php

Line Number: 197

Backtrace:

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 197
Function: file_get_contents

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 271
Function: simplexml_load_file_from_url

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 1075
Function: getPubMedXML

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 3195
Function: GetPubMedArticleOutput_2016

File: /var/www/html/application/controllers/Detail.php
Line: 597
Function: pubMedSearch_Global

File: /var/www/html/application/controllers/Detail.php
Line: 511
Function: pubMedGetRelatedKeyword

File: /var/www/html/index.php
Line: 317
Function: require_once

Decoupling upper and lower face transformers for binary interactive video generation. | LitMetric

Decoupling upper and lower face transformers for binary interactive video generation.

Neural Netw

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China. Electronic address:

Published: November 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Current audio-driven binary interaction methods have limitations in capturing the uncertain relationship between a speaker's audio and an interlocutor's facial movements. To address this issue, we propose a video generation pipeline based on a cross-modal Transformer. First, a Transformer decoder partitions facial features into upper and lower regions, capturing lower features that are closely linked to the audio and upper features that remain independent of visual cues. Second, we design a cross-modal attention module that combines alignment bias with causal attention mechanisms to effectively manage subtle motion variations between adjacent frames in facial sequences. To mitigate uncertainties in long-term contexts, we expand the self-attention range of the Transformer encoder and integrate self-supervised pretrained speech representations to alleviate data scarcity. Finally, by optimizing the audio-to-action mapping and incorporating an enhanced neural renderer, we achieve fine control over facial movements while generating high-quality portrait images. Extensive experiments validate the effectiveness and superiority of our approach in interactive video generation.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2025.107714DOI Listing

Publication Analysis

Top Keywords

video generation
12
upper lower
8
interactive video
8
facial movements
8
decoupling upper
4
lower face
4
face transformers
4
transformers binary
4
binary interactive
4
generation current
4

Similar Publications