Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2024.3518759DOI Listing

Publication Analysis

Top Keywords

cross-modal matching
28
structure-aware distillation
12
matching
12
matching representations
12
cross-modal
10
vision-language retrieval
8
common space
8
retrieval performance
8
cross-modal retrieval
8
imbalanced modalities
8

Similar Publications

Distinct Neural Mechanisms of Visual and Sound Adaptation in the Cat Visual Cortex.

Eur J Neurosci

September 2025

The Tampa Human Neurophysiology Lab, Department of Neurosurgery, Brain and Spine, Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.

Sensory areas exhibit modular selectivity to stimuli, but they can also respond to features outside of their basic modality. Several studies have shown cross-modal plastic modifications between visual and auditory cortices; however, the exact mechanisms of these modifications are yet not completely known. To this aim, we investigated the effect of 12 min of visual versus sound adaptation (referring to forceful application of an optimal/nonoptimal stimulus to a neuron[s] under observation) on the infragranular and supragranular primary visual neurons (V1) of the cat (Felis catus).

View Article and Find Full Text PDF

Visual category-selective representations in human ventral occipital temporal cortex (VOTC) seem to emerge early in infancy. Surprisingly, the VOTC of congenitally blind humans features category-selectivity for auditory and haptic objects. Yet it has been unknown whether VOTC would show category-selective visual responses if sight were restored in congenitally blind humans.

View Article and Find Full Text PDF

Cross-modal adaptive reconstruction of open education resources.

Sci Rep

August 2025

Engineering Education Innovation Research Center, The Open University of Sichuan, Chengdu, 610073, China.

Matching vast online resources to individual learners' needs remains a major challenge, especially for adults with diverse backgrounds. To address this challenge, we proposed a Dynamic Knowledge Graph-enhanced Cross-Modal Recommendation model (DKG-CMR) to solve the problem. This model utilizes a dynamic knowledge graph-a structure organizing information and relationships-that continuously updates based on learner actions and course objectives.

View Article and Find Full Text PDF

In today's medical diagnosis field, accurate diagnosis of many diseases plays an important role. As emerging non-invasive diagnostic technologies, Raman spectroscopy and infrared spectroscopy have shown unique advantages in the detection of disease characteristic markers due to their advantages such as high sensitivity and specificity. However, the diagnosis of some diseases by using only Raman or infrared single-spectral technology is relatively insufficient.

View Article and Find Full Text PDF

Objective: The study aims to address the challenge of aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which is difficult due to their substantial differences in viewing range and the amorphous appearance of the retina. Currently, no specialized method exists for this task, and existing image alignment techniques lack accuracy.

Methods: We propose Active Diffusion Matching (ADM), a novel cross-modal alignment method.

View Article and Find Full Text PDF