Fine-Grained Audio-Visual Event Localization.

Baoyu Fan , Lu Liu , Xiaochuan Li , Runze Zhang , Liang Jin , Jin Zhang

IEEE Trans Neural Netw Learn Syst

Published: September 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Audio-visual event localization (AVEL) aims to recognize events in videos by associating audio-visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies. However, this is challenging because they are more difficult to detect or distinguish compared with coarse-grained events. To better address this problem, we discuss a new setting of fine-grained AVEL from dataset to method. First, we constructed the first fine-grained audio-visual event dataset, which is called IT-AVE, relying on videos of playing musical instruments, containing 13k video clips and over 52k audio-visual events. All events are labeled from professional music practitioners, and the event categories are all derived from playing techniques, which are fine-grained with little interclass variation. Next, we designed a new fine-grained event localization method, spatial-temporal video event detector (SVED), which focuses on the challenges that fine-grained events are more imperceptible and prone to be disturbed. Finally, we conduct extensive experiments based on the proposed IT-AVE dataset versus fine-grained versions of two existing related datasets, including UnAV-22 derived from UnAV-100 and FineAction-AV derived from FineAction. Experimental results demonstrate the effectiveness of our method. We hope that this work will contribute to the exploration of an integrated understanding of audio-visual videos.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TNNLS.2025.3600878	DOI Listing

Publication Analysis

Top Keywords

audio-visual event

event localization

fine-grained audio-visual

events

audio-visual events

coarse-grained events

fine-grained

event

audio-visual

localization audio-visual

Similar Publications

Fine-Grained Audio-Visual Event Localization.

IEEE Trans Neural Netw Learn Syst

September 2025

Baoyu Fan , Lu Liu , Xiaochuan Li , Runze Zhang , Liang Jin

View Article and Find Full Text PDF

Similar Publications

Neural processing of rhythmic speech by children with developmental language disorder (DLD): An EEG study.

Imaging Neurosci (Camb)

December 2024

Centre for Neuroscience in Education, Department of Psychology, University of Cambridge, Cambridge, United Kingdom.

Mahmoud Keshavarzi , Susan Richards , Georgia Feltham , Lyla Parvez , Usha Goswami

Sensitivity to rhythmic and prosodic cues in speech has been described as a precursor of language acquisition. Consequently, atypical rhythmic processing during infancy and early childhood has been considered a risk factor for developmental language disorders. Despite many behavioural studies, the neural processing of rhythmic speech has not yet been explored in children with developmental language disorder (DLD).

View Article and Find Full Text PDF

Similar Publications

Hypergraph-based Audio-Visual Fusion for Obstructive Sleep Apnea Severity Estimation During Wakefulness.

IEEE J Biomed Health Inform

August 2025

Biao Xue , Yanting Shao , Zhichao Wang , ChangHong Fu , Xiaohua Zhu

Obstructive sleep apnea (OSA) is associated with psychophysiological impairments, and recent studies have shown the feasibility of using speech and craniofacial images during wakefulness for severity estimation. However, the inherent limitations of unimodal data constrain the performance of current methods. To address this, we proposed a novel hypergraph-based multimodal fusion framework (HMFusion) that integrates psychophysiological information from audio-visual data.

View Article and Find Full Text PDF

Similar Publications

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization.

IEEE Trans Pattern Anal Mach Intell

August 2025

Tiantian Geng , Teng Wang , Jinming Duan , Yanfu Zhang , Weili Guan

Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding.

View Article and Find Full Text PDF

Similar Publications

EPIC-SOUNDS: A Large-Scale Dataset of Actions That Sound.

IEEE Trans Pattern Anal Mach Intell

July 2025

Jaesung Huh , Jacob Chalk , Evangelos Kazakos , Dima Damen , Andrew Zisserman

We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.

View Article and Find Full Text PDF

Similar Publications