Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Multimodal Large Language Models (MLLMs) have gained significant attention due to their rich internal implicit knowledge for cross-modal learning. Although advances in bringing audio-visuals into LLMs have resulted in boosts for a variety of Audio-Visual Question Answering (AVQA) tasks, they still face two crucial challenges: 1) audio-visual ambiguity, and 2) audio-visual hallucination. Existing MLLMs can respond to audio-visual content, yet sometimes fail to describe specific objects due to the ambiguity or hallucination of responses. To overcome the two aforementioned issues, we introduce the CAT+, which enhances MLLM to ensure more robust multimodal understanding. We first propose the Sequential Question-guided Module (SQM), which combines tiny transformer layers and cascades Q-Formers to realize a solid audio-visual grounding. After feature alignment and high-quality instruction tuning, we introduce Ambiguity Scoring Direct Preference Optimization (AS-DPO) to correct the problem of CAT+ bias toward ambiguous descriptions. To explore the hallucinatory deficits of MLLMs in dynamic audio-visual scenes, we build a new Audio-visual Hallucination Benchmark, named AVHbench. This benchmark detects the extent of MLLM's hallucinations across three different protocols in the perceptual object, counting, and holistic description tasks. Extensive experiments across video-based understanding, open-ended, and close-ended AVQA demonstrate the superior performance of our method.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TPAMI.2025.3582389DOI Listing

Publication Analysis

Top Keywords

audio-visual
8
large language
8
language models
8
audio-visual hallucination
8
cat+ investigating
4
investigating enhancing
4
enhancing audio-visual
4
audio-visual understanding
4
understanding large
4
models multimodal
4

Similar Publications

Background: Affiliate stigma (AS) is self-stigma in caregivers, having three salient components: affective, behavioral, and cognitive. High caregiver AS causes concealment of mental illness and has negative consequences. Appropriate intervention for AS can offset such consequences.

View Article and Find Full Text PDF

Recognition memory is typically better for items learned after a free choice (independent of study material) than after a forced choice. However, previous studies presented to-be-remembered items in isolation, whereas everyday learning often occurs alongside distractors. Therefore, this study investigated the effect of free versus forced choice on recognition memory in a learning situation with both relevant (to-be-remembered) and irrelevant (to-be-ignored) items.

View Article and Find Full Text PDF

The speech and language rehabilitation are essential to people who have disorders of communication that may occur due to the condition of neurological disorder, developmental delays, or bodily disabilities. With the advent of deep learning, we introduce an improved multimodal rehabilitation pipeline that incorporates audio, video, and text information in order to provide patient-tailored therapy that adapts to the patient. The technique uses a cross-attention fusion multimodal hierarchical transformer architectural model that allows it to jointly design speech acoustics as well as the facial dynamics, lip articulation, and linguistic context.

View Article and Find Full Text PDF

Audio-visual event localization (AVEL) aims to recognize events in videos by associating audio-visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies.

View Article and Find Full Text PDF

Objectives: In recent years, there has been a profound increase in the use of remote online communication as a supplement to, and in many cases a replacement for, in-person interactions. While online communication tools hold potential to improve accessibility, previous studies have suggested that increased reliance on remote communication poses additional challenges for people with hearing loss, including those with a cochlear implant (CI). This study aimed to investigate the preferences and speech-reception performance of adults with a CI during online communication.

View Article and Find Full Text PDF