Trustworthy Visual-Textual Retrieval.

Yang Qin , Lifu Huang , Dezhong Peng , Bohan Jiang , Joey Tianyi Zhou , Xi Peng , Peng Hu

IEEE Trans Image Process

Published: January 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Visual-textual retrieval, as a link between computer vision and natural language processing, aims at jointly learning visual-semantic relevance to bridge the heterogeneity gap across visual and textual spaces. Existing methods conduct retrieval only relying on the ranking of pairwise similarities, but they cannot self-evaluate the uncertainty of retrieved results, resulting in unreliable retrieval and hindering interpretability. To address this problem, we propose a novel Trust-Consistent Learning framework (TCL) to endow visual-textual retrieval with uncertainty evaluation for trustworthy retrieval. More specifically, TCL first models the matching evidence according to cross-modal similarity to estimate the uncertainty for cross-modal uncertainty-aware learning. Second, a simple yet effective consistency module is presented to enforce the subjective opinions of bidirectional learning to be consistent for high reliability and accuracy. Finally, extensive experiments are conducted to demonstrate the superiority and generalizability of TCL on six widely-used benchmark datasets, i.e., Flickr30K, MS-COCO, MSVD, MSR-VTT, ActivityNet, and DiDeMo. Furthermore, some qualitative experiments are carried out to provide comprehensive and insightful analyses for trustworthy visual-textual retrieval, verifying the reliability and interoperability of TCL. The code is available in https://github.com/QinYang79/TCL.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2025.3587575	DOI Listing

Publication Analysis

Top Keywords

visual-textual retrieval

trustworthy visual-textual

retrieval

retrieval visual-textual

retrieval link

link computer

computer vision

vision natural

natural language

language processing

Similar Publications

Hierarchical knowledge-guided reasoning for text-based person re-identification.

Neural Netw

July 2025

Laboratory of Digitizing Software for Frontier Equipment, National University of Defence Technology, Changsha, 410073, Hunan, China; National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha, 410073, Hunan, China; School of Information and Art

Ruigeng Zeng , Wentao Ma , Tongqing Zhou , Shan Zhao , Xinjun Mao

Masked language modeling (MLM) has expanded the exploration of text-image person re-identification (TIReID) tasks from coarse-granularity to fine-grained alignment. Whereas, we note that vanilla MLM picks random tokens for visual-to-token reasoning, which could fail the intention of semantic visual-textual alignment by indistinguishably focusing on all the sub-words. This work proposes to leverage the inherent hierarchical scene graph knowledge in each text for guiding token masking and enhancing cross-modal representation in TIReID, thus relieving the pitfall of blind visual-textual alignment.

View Article and Find Full Text PDF

Similar Publications

MGTR-MISS: More Ground Truth Retrieving based Multimodal Interaction and Semantic Supervision for video description.

Neural Netw

July 2025

College of Electronic and Information Engineering, Tongji University, Shanghai 201804, PR China; School of Computer Science and Technology, Tongji University, Shanghai 201804, PR China; Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 201804

Jiayu Zhang , Pengjie Tang , Yunlan Tan , Hanli Wang

Describing a video with accurate words and appropriate sentence patterns is interesting and challenging. Recently, excellent models have been proposed to generate smooth and semantically rich video descriptions. However, the language generally does not participate in encoding training, and different modalities including vision and language cannot be effectively interacted and accurately aligned.

View Article and Find Full Text PDF

Similar Publications

Trustworthy Visual-Textual Retrieval.

IEEE Trans Image Process

January 2025

Yang Qin , Lifu Huang , Dezhong Peng , Bohan Jiang , Joey Tianyi Zhou

View Article and Find Full Text PDF

Similar Publications

Text-Video Retrieval With Global-LocalSemantic Consistent Learning.

IEEE Trans Image Process

January 2025

Haonan Zhang , Pengpeng Zeng , Lianli Gao , Jingkuan Song , Yihang Duan

Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval.

View Article and Find Full Text PDF

Similar Publications

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.

Neural Netw

April 2025

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China; Beijing Key Laboratory of Network System and Network Culture, Beijing, China.

Delong Liu , Haiwen Li , Zhicheng Zhao , Yuan Dong

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed.

View Article and Find Full Text PDF

Similar Publications