CAT+: Investigating and Enhancing Audio-visual Understanding in Large Language Models.

Qilang Ye , Zitong Yu , Rui Shao , Yawen Cui , Xiangui Kang , Xin Liu , Philip Torr , Xiaochun Cao

IEEE Trans Pattern Anal Mach Intell

Published: June 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Multimodal Large Language Models (MLLMs) have gained significant attention due to their rich internal implicit knowledge for cross-modal learning. Although advances in bringing audio-visuals into LLMs have resulted in boosts for a variety of Audio-Visual Question Answering (AVQA) tasks, they still face two crucial challenges: 1) audio-visual ambiguity, and 2) audio-visual hallucination. Existing MLLMs can respond to audio-visual content, yet sometimes fail to describe specific objects due to the ambiguity or hallucination of responses. To overcome the two aforementioned issues, we introduce the CAT+, which enhances MLLM to ensure more robust multimodal understanding. We first propose the Sequential Question-guided Module (SQM), which combines tiny transformer layers and cascades Q-Formers to realize a solid audio-visual grounding. After feature alignment and high-quality instruction tuning, we introduce Ambiguity Scoring Direct Preference Optimization (AS-DPO) to correct the problem of CAT+ bias toward ambiguous descriptions. To explore the hallucinatory deficits of MLLMs in dynamic audio-visual scenes, we build a new Audio-visual Hallucination Benchmark, named AVHbench. This benchmark detects the extent of MLLM's hallucinations across three different protocols in the perceptual object, counting, and holistic description tasks. Extensive experiments across video-based understanding, open-ended, and close-ended AVQA demonstrate the superior performance of our method.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2025.3582389	DOI Listing

Publication Analysis

Top Keywords

audio-visual

large language

language models

audio-visual hallucination

cat+ investigating

investigating enhancing

enhancing audio-visual

audio-visual understanding

understanding large

models multimodal

Similar Publications

Development, and pilot study of a single-session 'Intervention for Caregiver Affiliate stigma REduction' (i-CARE) in a tertiary care specialty psychiatry setting in North India for severe mental illnesses.

Indian J Psychiatry

August 2025

Department of Clinical Psychology, Institute of Human Behaviour and Allied Sciences, Delhi, India.

Tanve Garg , Vishal Dhiman , Anindya Das , Naveen Grover

Background: Affiliate stigma (AS) is self-stigma in caregivers, having three salient components: affective, behavioral, and cognitive. High caregiver AS causes concealment of mental illness and has negative consequences. Appropriate intervention for AS can offset such consequences.

View Article and Find Full Text PDF

Similar Publications

The influence of free choice on recognition memory in the face of distraction.

Memory

September 2025

Department of Psychology, University of Trier, Trier, Germany.

Kerstin Fröber , Bernhard Pastötter

Recognition memory is typically better for items learned after a free choice (independent of study material) than after a forced choice. However, previous studies presented to-be-remembered items in isolation, whereas everyday learning often occurs alongside distractors. Therefore, this study investigated the effect of free versus forced choice on recognition memory in a learning situation with both relevant (to-be-remembered) and irrelevant (to-be-ignored) items.

View Article and Find Full Text PDF

Similar Publications

Multimodal deep learning methods for speech and language rehabilitation: a cross-sectional observational study.

Disabil Rehabil Assist Technol

September 2025

School of Foreign Languages, Ningbo University of Technology, Ningbo, China.

Xinqiao Cen

The speech and language rehabilitation are essential to people who have disorders of communication that may occur due to the condition of neurological disorder, developmental delays, or bodily disabilities. With the advent of deep learning, we introduce an improved multimodal rehabilitation pipeline that incorporates audio, video, and text information in order to provide patient-tailored therapy that adapts to the patient. The technique uses a cross-attention fusion multimodal hierarchical transformer architectural model that allows it to jointly design speech acoustics as well as the facial dynamics, lip articulation, and linguistic context.

View Article and Find Full Text PDF

Similar Publications

Fine-Grained Audio-Visual Event Localization.

IEEE Trans Neural Netw Learn Syst

September 2025

Baoyu Fan , Lu Liu , Xiaochuan Li , Runze Zhang , Liang Jin

Audio-visual event localization (AVEL) aims to recognize events in videos by associating audio-visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies.

View Article and Find Full Text PDF

Similar Publications

Preferences and Listening Efficiency of Adults With Cochlear Implants During Online Communication.

Ear Hear

September 2025

National Institute for Health and Care Research (NIHR) Nottingham Biomedical Research Centre, Nottingham, United Kingdom.

Francisca Perea Pérez , Douglas E H Hartley , Pádraig T Kitterick , Ian M Wiggins

Objectives: In recent years, there has been a profound increase in the use of remote online communication as a supplement to, and in many cases a replacement for, in-person interactions. While online communication tools hold potential to improve accessibility, previous studies have suggested that increased reliance on remote communication poses additional challenges for people with hearing loss, including those with a cochlear implant (CI). This study aimed to investigate the preferences and speech-reception performance of adults with a CI during online communication.

View Article and Find Full Text PDF

Similar Publications