Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation.

Yang Yang , Wenjuan Xi , Luping Zhou , Jinhui Tang

IEEE Trans Image Process

Published: December 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2024.3518759	DOI Listing

Publication Analysis

Top Keywords

cross-modal matching

structure-aware distillation

matching

matching representations

cross-modal

vision-language retrieval

common space

retrieval performance

cross-modal retrieval

imbalanced modalities

Similar Publications

Distinct Neural Mechanisms of Visual and Sound Adaptation in the Cat Visual Cortex.

Eur J Neurosci

September 2025

The Tampa Human Neurophysiology Lab, Department of Neurosurgery, Brain and Spine, Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.

Yahia Yassine Belkacemi , Ehsan Mokhtarinejad , Solène Hospital , Nayan Chanauria , Oliver Flouty

Sensory areas exhibit modular selectivity to stimuli, but they can also respond to features outside of their basic modality. Several studies have shown cross-modal plastic modifications between visual and auditory cortices; however, the exact mechanisms of these modifications are yet not completely known. To this aim, we investigated the effect of 12 min of visual versus sound adaptation (referring to forceful application of an optimal/nonoptimal stimulus to a neuron[s] under observation) on the infragranular and supragranular primary visual neurons (V1) of the cat (Felis catus).

View Article and Find Full Text PDF

Similar Publications

Visual and Auditory Object Representations in Ventral Visual Cortex After Restoring Sight in Humans.

Hum Brain Mapp

August 2025

Biological Psychology and Neuropsychology, University of Hamburg, Hamburg, Germany.

Katarzyna Rączy , Madita Linke , Job van den Hurk , Carolin Heitmann , Maria J S Guerreiro

Visual category-selective representations in human ventral occipital temporal cortex (VOTC) seem to emerge early in infancy. Surprisingly, the VOTC of congenitally blind humans features category-selectivity for auditory and haptic objects. Yet it has been unknown whether VOTC would show category-selective visual responses if sight were restored in congenitally blind humans.

View Article and Find Full Text PDF

Similar Publications

Cross-modal adaptive reconstruction of open education resources.

Sci Rep

August 2025

Engineering Education Innovation Research Center, The Open University of Sichuan, Chengdu, 610073, China.

Tang Shengju , Feng Li , Zhan Wang , Xie Zhaoyuan

Matching vast online resources to individual learners' needs remains a major challenge, especially for adults with diverse backgrounds. To address this challenge, we proposed a Dynamic Knowledge Graph-enhanced Cross-Modal Recommendation model (DKG-CMR) to solve the problem. This model utilizes a dynamic knowledge graph-a structure organizing information and relationships-that continuously updates based on learner actions and course objectives.

View Article and Find Full Text PDF

Similar Publications

Research on disease diagnosis technology based on the fusion of multi-spectrum matching synergistic attention mechanism in Raman and infrared spectroscopy.

Spectrochim Acta A Mol Biomol Spectrosc

January 2026

College of Software, Xinjiang University, Urumqi 830046, China.. Electronic address:

Xiangnan Chen , Xuguang Zhou , Xiaoyi Lv , Lijun Wu , Jiahe Li

In today's medical diagnosis field, accurate diagnosis of many diseases plays an important role. As emerging non-invasive diagnostic technologies, Raman spectroscopy and infrared spectroscopy have shown unique advantages in the detection of disease characteristic markers due to their advantages such as high sensitivity and specificity. However, the diagnosis of some diseases by using only Raman or infrared single-spectral technology is relatively insufficient.

View Article and Find Full Text PDF

Similar Publications

Active Diffusion Matching: Score-Based Iterative Alignment of Cross-Modal Retinal Images.

IEEE Trans Biomed Eng

August 2025

Kang Geon Lee , Su Jeong Song , Soochahn Lee , Kyoung Mu Lee

Objective: The study aims to address the challenge of aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which is difficult due to their substantial differences in viewing range and the amorphous appearance of the retina. Currently, no specialized method exists for this task, and existing image alignment techniques lack accuracy.

Methods: We propose Active Diffusion Matching (ADM), a novel cross-modal alignment method.

View Article and Find Full Text PDF

Similar Publications