MsDUNE: A multi-scale masked temporal fusion framework for speaker-independent lipreading via Dirichlet uncertainty estimation.

Jinghan Wu , Xingwei An , Yakun Zhang , Changyan Zheng , Xingyu Zhang , Liang Xie , Erwei Yin

Neural Netw

Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, 300072, China; National Innovation Institute of Defense Technology, Academy of Military Sciences, Beijing, 100072, China; Tianjin Artificial Intelligence Innovation Center, Tianjin, 300450, China. Electronic addr

Published: November 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Lipreading, the task of recognizing speech based on visual cues from lip movements, typically requires a substantial amount of labeled training data to achieve optimal performance. However, this task is highly sensitive to variations among speakers, often resulting in significantly degraded recognition accuracy for unseen speakers. In this work, we introduce a novel framework, multi-scale masked temporal fusion with Dirichlet uncertainty estimation (MsDUNE), designed to mitigate the feature distribution disparities across different speakers. The proposed framework leverages a Dirichlet distribution to parameterize the latent space of a single feature branch, which is then quantitatively assessed through evidence and belief masses. Furthermore, MsDUNE calibrates multi-scale feature distributions by accounting for the mutual influence of feature beliefs between two branches, thereby enhancing the generalization capability of the lipreading model. We validate our approach through extensive experiments conducted on two widely recognized benchmarks, LRW-ID and AV Letters, as well as a self-collected lipreading dataset, CVSR100. The experimental results highlight the state-of-the-art performance of our method, particularly in scenarios involving unseen or overlapping speakers.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.neunet.2025.107783	DOI Listing

Publication Analysis

Top Keywords

multi-scale masked

masked temporal

temporal fusion

dirichlet uncertainty

uncertainty estimation

msdune multi-scale

fusion framework

framework speaker-independent

lipreading

speaker-independent lipreading

Similar Publications

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations.

IEEE Trans Pattern Anal Mach Intell

September 2025

Cheng Lei , Jie Fan , Xinran Li , Tian-Zhu Xiang , Ao Li

Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?", we propose an affirmative solution. We analyze the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework.

View Article and Find Full Text PDF

Similar Publications

Automated Kidney Tumor Segmentation in CT Images Using Deep Learning: A Multi-Stage Approach.

Acad Radiol

September 2025

In-Service Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan (H.-C.K., S.-J.P.); Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei Medical University, Taipei, Taiwan (S.-J.P.). Electronic address: sjpeng2

Hung-Cheng Kan , Geng-Ming Fan , Ming-Hao Wei , Po-Hung Lin , I-Hung Shao

Rationale And Objectives: Computed tomography (CT) remains the primary modality for assessing renal tumors; however, tumor identification and segmentation rely heavily on manual interpretation by clinicians, which is time-consuming and subject to inter-observer variability. The heterogeneity of tumor appearance and indistinct margins further complicate accurate delineation, impacting histopathological classification, treatment planning, and prognostic assessment. There is a pressing clinical need for an automated segmentation tool to enhance diagnostic workflows and support clinical decision-making with results that are reliable, accurate, and reproducible.

View Article and Find Full Text PDF

Similar Publications

Multi-task deep learning for automatic image segmentation and treatment response assessment in metastatic ovarian cancer.

Int J Comput Assist Radiol Surg

September 2025

Department of Oncology, University of Cambridge, Cambridge, United Kingdom.

Bevis Drury , Inês P Machado , Zeyu Gao , Thomas Buddenkotte , Golnar Mahani

Purpose: : High-grade serous ovarian carcinoma (HGSOC) is characterised by significant spatial and temporal heterogeneity, often presenting at an advanced metastatic stage. One of the most common treatment approaches involves neoadjuvant chemotherapy (NACT), followed by surgery. However, the multi-scale complexity of HGSOC poses a major challenge in evaluating response to NACT.

View Article and Find Full Text PDF

Similar Publications

A Motion Segmentation Dynamic SLAM for Indoor GNSS-Denied Environments.

Sensors (Basel)

August 2025

College of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an 710021, China.

Yunhao Wu , Ziyao Zhang , Haifeng Chen , Jian Li

In GNSS-deprived settings, such as indoor and underground environments, research on simultaneous localization and mapping (SLAM) technology remains a focal point. Addressing the influence of dynamic variables on positional precision and constructing a persistent map comprising solely static elements are pivotal objectives in visual SLAM for dynamic scenes. This paper introduces optical flow motion segmentation-based SLAM(OS-SLAM), a dynamic environment SLAM system that incorporates optical flow motion segmentation for enhanced robustness.

View Article and Find Full Text PDF

Similar Publications

Context-Guided SAR Ship Detection with Prototype-Based Model Pretraining and Check-Balance-Based Decision Fusion.

Sensors (Basel)

August 2025

College of Electronics and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China.

Haowen Zhou , Zhe Geng , Minjie Sun , Linyi Wu , He Yan

To address the challenging problem of multi-scale inshore-offshore ship detection in synthetic aperture radar (SAR) remote sensing images, we propose a novel deep learning-based automatic ship detection method within the framework of compositional learning. The proposed method is supported by three pillars: context-guided region proposal, prototype-based model-pretraining, and multi-model ensemble learning. To reduce the false alarms induced by the discrete ground clutters, the prior knowledge of the harbour's layout is exploited to generate land masks for terrain delimitation.

View Article and Find Full Text PDF

Similar Publications