Hybrid DAER Based Cross-Modal Retrieval Exploiting Deep Representation Learning.

Entropy (Basel)

School of Computer Science, Shaanxi Normal University, Xi'an 710119, China.

Published: August 2023


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Information retrieval across multiple modes has attracted much attention from academics and practitioners. One key challenge of cross-modal retrieval is to eliminate the heterogeneous gap between different patterns. Most of the existing methods tend to jointly construct a common subspace. However, very little attention has been given to the study of the importance of different fine-grained regions of various modalities. This lack of consideration significantly influences the utilization of the extracted information of multiple modalities. Therefore, this study proposes a novel text-image cross-modal retrieval approach that constructs a dual attention network and an enhanced relation network (DAER). More specifically, the dual attention network tends to precisely extract fine-grained weight information from text and images, while the enhanced relation network is used to expand the differences between different categories of data in order to improve the computational accuracy of similarity. The comprehensive experimental results on three widely-used major datasets (i.e., Wikipedia, Pascal Sentence, and XMediaNet) show that our proposed approach is effective and superior to existing cross-modal retrieval methods.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10452985PMC
http://dx.doi.org/10.3390/e25081216DOI Listing

Publication Analysis

Top Keywords

cross-modal retrieval
16
dual attention
8
attention network
8
enhanced relation
8
relation network
8
retrieval
5
hybrid daer
4
daer based
4
cross-modal
4
based cross-modal
4

Similar Publications

Cross-modal hashing aims to leverage hashing functions to map multimodal data into a unified low-dimensional space, realizing efficient cross-modal retrieval. In particular, unsupervised cross-modal hashing methods attract significant attention for not needing external label information. However, in the field of unsupervised cross-modal hashing, there are several pressing issues to address: (1) how to facilitate semantic alignment between modalities, and (2) how to effectively capture the intrinsic relationships between data, thereby constructing a more reliable affinity matrix to assist in the learning of hash codes.

View Article and Find Full Text PDF

Unlabelled: Foundation models, including large language models and vision-language models (VLMs), have revolutionized artificial intelligence by enabling efficient, scalable, and multimodal learning across diverse applications. By leveraging advancements in self-supervised and semi-supervised learning, these models integrate computer vision and natural language processing to address complex tasks, such as disease classification, segmentation, cross-modal retrieval, and automated report generation. Their ability to pretrain on vast, uncurated datasets minimizes reliance on annotated data while improving generalization and adaptability for a wide range of downstream tasks.

View Article and Find Full Text PDF

In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through complex or evolving instructions. To address this challenge, we propose a novel cross-modal representation learning framework that incorporates an instruction-aware dynamic query generation mechanism, augmented by the semantic reasoning capabilities of large language models (LLMs).

View Article and Find Full Text PDF

MolPrompt: Improving multi-modal molecular pre-training with knowledge prompts.

Bioinformatics

August 2025

College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China.

Motivation: Molecular pre-training has emerged as a foundational approach in computational drug discovery, enabling the extraction of expressive molecular representations from large-scale unlabeled datasets. However, existing methods largely focus on topological or structural features, often neglecting critical physicochemical attributes embedded in molecular systems.

Result: We present MolPrompt, a knowledge-enhanced multimodal pre-training framework that integrates molecular graphs and textual descriptions via contrastive learning.

View Article and Find Full Text PDF

Text-based person retrieval is defined as the challenging task of searching for people's images based on given textual queries in natural language. Conventional methods primarily use deep neural networks to understand the relationship between visual and textual data, creating a shared feature space for cross-modal matching. The absence of awareness regarding variations in feature granularity between the two modalities, coupled with the diverse poses and viewing angles of images corresponding to the same individual, may lead to overlooking significant differences within each modality and across modalities, despite notable enhancements.

View Article and Find Full Text PDF