98%
921
2 minutes
20
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2024.3518759 | DOI Listing |
Eur J Neurosci
September 2025
The Tampa Human Neurophysiology Lab, Department of Neurosurgery, Brain and Spine, Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.
Sensory areas exhibit modular selectivity to stimuli, but they can also respond to features outside of their basic modality. Several studies have shown cross-modal plastic modifications between visual and auditory cortices; however, the exact mechanisms of these modifications are yet not completely known. To this aim, we investigated the effect of 12 min of visual versus sound adaptation (referring to forceful application of an optimal/nonoptimal stimulus to a neuron[s] under observation) on the infragranular and supragranular primary visual neurons (V1) of the cat (Felis catus).
View Article and Find Full Text PDFHum Brain Mapp
August 2025
Biological Psychology and Neuropsychology, University of Hamburg, Hamburg, Germany.
Visual category-selective representations in human ventral occipital temporal cortex (VOTC) seem to emerge early in infancy. Surprisingly, the VOTC of congenitally blind humans features category-selectivity for auditory and haptic objects. Yet it has been unknown whether VOTC would show category-selective visual responses if sight were restored in congenitally blind humans.
View Article and Find Full Text PDFSci Rep
August 2025
Engineering Education Innovation Research Center, The Open University of Sichuan, Chengdu, 610073, China.
Matching vast online resources to individual learners' needs remains a major challenge, especially for adults with diverse backgrounds. To address this challenge, we proposed a Dynamic Knowledge Graph-enhanced Cross-Modal Recommendation model (DKG-CMR) to solve the problem. This model utilizes a dynamic knowledge graph-a structure organizing information and relationships-that continuously updates based on learner actions and course objectives.
View Article and Find Full Text PDFSpectrochim Acta A Mol Biomol Spectrosc
January 2026
College of Software, Xinjiang University, Urumqi 830046, China.. Electronic address:
In today's medical diagnosis field, accurate diagnosis of many diseases plays an important role. As emerging non-invasive diagnostic technologies, Raman spectroscopy and infrared spectroscopy have shown unique advantages in the detection of disease characteristic markers due to their advantages such as high sensitivity and specificity. However, the diagnosis of some diseases by using only Raman or infrared single-spectral technology is relatively insufficient.
View Article and Find Full Text PDFIEEE Trans Biomed Eng
August 2025
Objective: The study aims to address the challenge of aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which is difficult due to their substantial differences in viewing range and the amorphous appearance of the retina. Currently, no specialized method exists for this task, and existing image alignment techniques lack accuracy.
Methods: We propose Active Diffusion Matching (ADM), a novel cross-modal alignment method.