MsDUNE: A multi-scale masked temporal fusion framework for speaker-independent lipreading via Dirichlet uncertainty estimation.

Neural Netw

Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, 300072, China; National Innovation Institute of Defense Technology, Academy of Military Sciences, Beijing, 100072, China; Tianjin Artificial Intelligence Innovation Center, Tianjin, 300450, China. Electronic addr

Published: November 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Lipreading, the task of recognizing speech based on visual cues from lip movements, typically requires a substantial amount of labeled training data to achieve optimal performance. However, this task is highly sensitive to variations among speakers, often resulting in significantly degraded recognition accuracy for unseen speakers. In this work, we introduce a novel framework, multi-scale masked temporal fusion with Dirichlet uncertainty estimation (MsDUNE), designed to mitigate the feature distribution disparities across different speakers. The proposed framework leverages a Dirichlet distribution to parameterize the latent space of a single feature branch, which is then quantitatively assessed through evidence and belief masses. Furthermore, MsDUNE calibrates multi-scale feature distributions by accounting for the mutual influence of feature beliefs between two branches, thereby enhancing the generalization capability of the lipreading model. We validate our approach through extensive experiments conducted on two widely recognized benchmarks, LRW-ID and AV Letters, as well as a self-collected lipreading dataset, CVSR100. The experimental results highlight the state-of-the-art performance of our method, particularly in scenarios involving unseen or overlapping speakers.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2025.107783DOI Listing

Publication Analysis

Top Keywords

multi-scale masked
8
masked temporal
8
temporal fusion
8
dirichlet uncertainty
8
uncertainty estimation
8
msdune multi-scale
4
fusion framework
4
framework speaker-independent
4
lipreading
4
speaker-independent lipreading
4

Similar Publications

Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?", we propose an affirmative solution. We analyze the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework.

View Article and Find Full Text PDF

Automated Kidney Tumor Segmentation in CT Images Using Deep Learning: A Multi-Stage Approach.

Acad Radiol

September 2025

In-Service Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan (H.-C.K., S.-J.P.); Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei Medical University, Taipei, Taiwan (S.-J.P.). Electronic address: sjpeng2

Rationale And Objectives: Computed tomography (CT) remains the primary modality for assessing renal tumors; however, tumor identification and segmentation rely heavily on manual interpretation by clinicians, which is time-consuming and subject to inter-observer variability. The heterogeneity of tumor appearance and indistinct margins further complicate accurate delineation, impacting histopathological classification, treatment planning, and prognostic assessment. There is a pressing clinical need for an automated segmentation tool to enhance diagnostic workflows and support clinical decision-making with results that are reliable, accurate, and reproducible.

View Article and Find Full Text PDF

Purpose:  : High-grade serous ovarian carcinoma (HGSOC) is characterised by significant spatial and temporal heterogeneity, often presenting at an advanced metastatic stage. One of the most common treatment approaches involves neoadjuvant chemotherapy (NACT), followed by surgery. However, the multi-scale complexity of HGSOC poses a major challenge in evaluating response to NACT.

View Article and Find Full Text PDF

A Motion Segmentation Dynamic SLAM for Indoor GNSS-Denied Environments.

Sensors (Basel)

August 2025

College of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an 710021, China.

In GNSS-deprived settings, such as indoor and underground environments, research on simultaneous localization and mapping (SLAM) technology remains a focal point. Addressing the influence of dynamic variables on positional precision and constructing a persistent map comprising solely static elements are pivotal objectives in visual SLAM for dynamic scenes. This paper introduces optical flow motion segmentation-based SLAM(OS-SLAM), a dynamic environment SLAM system that incorporates optical flow motion segmentation for enhanced robustness.

View Article and Find Full Text PDF

To address the challenging problem of multi-scale inshore-offshore ship detection in synthetic aperture radar (SAR) remote sensing images, we propose a novel deep learning-based automatic ship detection method within the framework of compositional learning. The proposed method is supported by three pillars: context-guided region proposal, prototype-based model-pretraining, and multi-model ensemble learning. To reduce the false alarms induced by the discrete ground clutters, the prior knowledge of the harbour's layout is exploited to generate land masks for terrain delimitation.

View Article and Find Full Text PDF