98%
921
2 minutes
20
Audio-visual event localization (AVEL) aims to recognize events in videos by associating audio-visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies. However, this is challenging because they are more difficult to detect or distinguish compared with coarse-grained events. To better address this problem, we discuss a new setting of fine-grained AVEL from dataset to method. First, we constructed the first fine-grained audio-visual event dataset, which is called IT-AVE, relying on videos of playing musical instruments, containing 13k video clips and over 52k audio-visual events. All events are labeled from professional music practitioners, and the event categories are all derived from playing techniques, which are fine-grained with little interclass variation. Next, we designed a new fine-grained event localization method, spatial-temporal video event detector (SVED), which focuses on the challenges that fine-grained events are more imperceptible and prone to be disturbed. Finally, we conduct extensive experiments based on the proposed IT-AVE dataset versus fine-grained versions of two existing related datasets, including UnAV-22 derived from UnAV-100 and FineAction-AV derived from FineAction. Experimental results demonstrate the effectiveness of our method. We hope that this work will contribute to the exploration of an integrated understanding of audio-visual videos.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2025.3600878 | DOI Listing |
Audio-visual event localization (AVEL) aims to recognize events in videos by associating audio-visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies.
View Article and Find Full Text PDFImaging Neurosci (Camb)
December 2024
Centre for Neuroscience in Education, Department of Psychology, University of Cambridge, Cambridge, United Kingdom.
Sensitivity to rhythmic and prosodic cues in speech has been described as a precursor of language acquisition. Consequently, atypical rhythmic processing during infancy and early childhood has been considered a risk factor for developmental language disorders. Despite many behavioural studies, the neural processing of rhythmic speech has not yet been explored in children with developmental language disorder (DLD).
View Article and Find Full Text PDFIEEE J Biomed Health Inform
August 2025
Obstructive sleep apnea (OSA) is associated with psychophysiological impairments, and recent studies have shown the feasibility of using speech and craniofacial images during wakefulness for severity estimation. However, the inherent limitations of unimodal data constrain the performance of current methods. To address this, we proposed a novel hypergraph-based multimodal fusion framework (HMFusion) that integrates psychophysiological information from audio-visual data.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2025
Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
July 2025
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
View Article and Find Full Text PDF