Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Audio-visual event localization (AVEL) aims to recognize events in videos by associating audio-visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies. However, this is challenging because they are more difficult to detect or distinguish compared with coarse-grained events. To better address this problem, we discuss a new setting of fine-grained AVEL from dataset to method. First, we constructed the first fine-grained audio-visual event dataset, which is called IT-AVE, relying on videos of playing musical instruments, containing 13k video clips and over 52k audio-visual events. All events are labeled from professional music practitioners, and the event categories are all derived from playing techniques, which are fine-grained with little interclass variation. Next, we designed a new fine-grained event localization method, spatial-temporal video event detector (SVED), which focuses on the challenges that fine-grained events are more imperceptible and prone to be disturbed. Finally, we conduct extensive experiments based on the proposed IT-AVE dataset versus fine-grained versions of two existing related datasets, including UnAV-22 derived from UnAV-100 and FineAction-AV derived from FineAction. Experimental results demonstrate the effectiveness of our method. We hope that this work will contribute to the exploration of an integrated understanding of audio-visual videos.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2025.3600878DOI Listing

Publication Analysis

Top Keywords

audio-visual event
12
event localization
12
fine-grained audio-visual
8
events
8
audio-visual events
8
coarse-grained events
8
fine-grained
7
event
6
audio-visual
5
localization audio-visual
4

Similar Publications

Audio-visual event localization (AVEL) aims to recognize events in videos by associating audio-visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies.

View Article and Find Full Text PDF

Sensitivity to rhythmic and prosodic cues in speech has been described as a precursor of language acquisition. Consequently, atypical rhythmic processing during infancy and early childhood has been considered a risk factor for developmental language disorders. Despite many behavioural studies, the neural processing of rhythmic speech has not yet been explored in children with developmental language disorder (DLD).

View Article and Find Full Text PDF

Obstructive sleep apnea (OSA) is associated with psychophysiological impairments, and recent studies have shown the feasibility of using speech and craniofacial images during wakefulness for severity estimation. However, the inherent limitations of unimodal data constrain the performance of current methods. To address this, we proposed a novel hypergraph-based multimodal fusion framework (HMFusion) that integrates psychophysiological information from audio-visual data.

View Article and Find Full Text PDF

Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding.

View Article and Find Full Text PDF

We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.

View Article and Find Full Text PDF