Hybrid DAER Based Cross-Modal Retrieval Exploiting Deep Representation Learning.

Zhao Huang , Haowu Hu , Miao Su

Entropy (Basel)

School of Computer Science, Shaanxi Normal University, Xi'an 710119, China.

Published: August 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Information retrieval across multiple modes has attracted much attention from academics and practitioners. One key challenge of cross-modal retrieval is to eliminate the heterogeneous gap between different patterns. Most of the existing methods tend to jointly construct a common subspace. However, very little attention has been given to the study of the importance of different fine-grained regions of various modalities. This lack of consideration significantly influences the utilization of the extracted information of multiple modalities. Therefore, this study proposes a novel text-image cross-modal retrieval approach that constructs a dual attention network and an enhanced relation network (DAER). More specifically, the dual attention network tends to precisely extract fine-grained weight information from text and images, while the enhanced relation network is used to expand the differences between different categories of data in order to improve the computational accuracy of similarity. The comprehensive experimental results on three widely-used major datasets (i.e., Wikipedia, Pascal Sentence, and XMediaNet) show that our proposed approach is effective and superior to existing cross-modal retrieval methods.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10452985	PMC
http://dx.doi.org/10.3390/e25081216	DOI Listing

Publication Analysis

Top Keywords

cross-modal retrieval

dual attention

attention network

enhanced relation

relation network

retrieval

hybrid daer

daer based

cross-modal

based cross-modal

Similar Publications

Dual aggregation based joint-modal similarity hashing for cross-modal retrieval.

Neural Netw

September 2025

Shanghai Maritime University, Shanghai, 201306, China. Electronic address:

Le Xu , Jun Yin

Cross-modal hashing aims to leverage hashing functions to map multimodal data into a unified low-dimensional space, realizing efficient cross-modal retrieval. In particular, unsupervised cross-modal hashing methods attract significant attention for not needing external label information. However, in the field of unsupervised cross-modal hashing, there are several pressing issues to address: (1) how to facilitate semantic alignment between modalities, and (2) how to effectively capture the intrinsic relationships between data, thereby constructing a more reliable affinity matrix to assist in the learning of hash codes.

View Article and Find Full Text PDF

Similar Publications

Vision-language foundation models for medical imaging: a review of current practices and innovations.

Biomed Eng Lett

September 2025

Department of Precision Medicine, Yonsei University Wonju College of Medicine, Wonju, Korea.

Ji Seung Ryu , Hyunyoung Kang , Yuseong Chu , Sejung Yang

Unlabelled: Foundation models, including large language models and vision-language models (VLMs), have revolutionized artificial intelligence by enabling efficient, scalable, and multimodal learning across diverse applications. By leveraging advancements in self-supervised and semi-supervised learning, these models integrate computer vision and natural language processing to address complex tasks, such as disease classification, segmentation, cross-modal retrieval, and automated report generation. Their ability to pretrain on vast, uncurated datasets minimizes reliance on annotated data while improving generalization and adaptability for a wide range of downstream tasks.

View Article and Find Full Text PDF

Similar Publications

InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation.

Sensors (Basel)

August 2025

National Engineering Research Center for Multimedia Software (NERCMS), Wuhan 430072, China.

Guihe Gu , Yuan Xue , Zhengqian Wu , Lin Song , Chao Liang

In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through complex or evolving instructions. To address this challenge, we propose a novel cross-modal representation learning framework that incorporates an instruction-aware dynamic query generation mechanism, augmented by the semantic reasoning capabilities of large language models (LLMs).

View Article and Find Full Text PDF

Similar Publications

MolPrompt: Improving multi-modal molecular pre-training with knowledge prompts.

Bioinformatics

August 2025

College of Computer and Control Engineering, Northeast Forestry University, Harbin, 150040, China.

Yang Li , Chang Liu , Xin Gao , Guohua Wang

Motivation: Molecular pre-training has emerged as a foundational approach in computational drug discovery, enabling the extraction of expressive molecular representations from large-scale unlabeled datasets. However, existing methods largely focus on topological or structural features, often neglecting critical physicochemical attributes embedded in molecular systems.

Result: We present MolPrompt, a knowledge-enhanced multimodal pre-training framework that integrates molecular graphs and textual descriptions via contrastive learning.

View Article and Find Full Text PDF

Similar Publications

Enhancing Text-Based Person Retrieval by Combining Fused Representation and Reciprocal Learning With Adaptive Loss Refinement.

IEEE Trans Image Process

January 2025

Anh D Nguyen , Hoa N Nguyen

Text-based person retrieval is defined as the challenging task of searching for people's images based on given textual queries in natural language. Conventional methods primarily use deep neural networks to understand the relationship between visual and textual data, creating a shared feature space for cross-modal matching. The absence of awareness regarding variations in feature granularity between the two modalities, coupled with the diverse poses and viewing angles of images corresponding to the same individual, may lead to overlooking significant differences within each modality and across modalities, despite notable enhancements.

View Article and Find Full Text PDF

Similar Publications