Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Visual-textual retrieval, as a link between computer vision and natural language processing, aims at jointly learning visual-semantic relevance to bridge the heterogeneity gap across visual and textual spaces. Existing methods conduct retrieval only relying on the ranking of pairwise similarities, but they cannot self-evaluate the uncertainty of retrieved results, resulting in unreliable retrieval and hindering interpretability. To address this problem, we propose a novel Trust-Consistent Learning framework (TCL) to endow visual-textual retrieval with uncertainty evaluation for trustworthy retrieval. More specifically, TCL first models the matching evidence according to cross-modal similarity to estimate the uncertainty for cross-modal uncertainty-aware learning. Second, a simple yet effective consistency module is presented to enforce the subjective opinions of bidirectional learning to be consistent for high reliability and accuracy. Finally, extensive experiments are conducted to demonstrate the superiority and generalizability of TCL on six widely-used benchmark datasets, i.e., Flickr30K, MS-COCO, MSVD, MSR-VTT, ActivityNet, and DiDeMo. Furthermore, some qualitative experiments are carried out to provide comprehensive and insightful analyses for trustworthy visual-textual retrieval, verifying the reliability and interoperability of TCL. The code is available in https://github.com/QinYang79/TCL.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2025.3587575DOI Listing

Publication Analysis

Top Keywords

visual-textual retrieval
16
trustworthy visual-textual
8
retrieval
7
retrieval visual-textual
4
retrieval link
4
link computer
4
computer vision
4
vision natural
4
natural language
4
language processing
4

Similar Publications

Hierarchical knowledge-guided reasoning for text-based person re-identification.

Neural Netw

July 2025

Laboratory of Digitizing Software for Frontier Equipment, National University of Defence Technology, Changsha, 410073, Hunan, China; National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha, 410073, Hunan, China; School of Information and Art

Masked language modeling (MLM) has expanded the exploration of text-image person re-identification (TIReID) tasks from coarse-granularity to fine-grained alignment. Whereas, we note that vanilla MLM picks random tokens for visual-to-token reasoning, which could fail the intention of semantic visual-textual alignment by indistinguishably focusing on all the sub-words. This work proposes to leverage the inherent hierarchical scene graph knowledge in each text for guiding token masking and enhancing cross-modal representation in TIReID, thus relieving the pitfall of blind visual-textual alignment.

View Article and Find Full Text PDF

MGTR-MISS: More Ground Truth Retrieving based Multimodal Interaction and Semantic Supervision for video description.

Neural Netw

July 2025

College of Electronic and Information Engineering, Tongji University, Shanghai 201804, PR China; School of Computer Science and Technology, Tongji University, Shanghai 201804, PR China; Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 201804

Describing a video with accurate words and appropriate sentence patterns is interesting and challenging. Recently, excellent models have been proposed to generate smooth and semantically rich video descriptions. However, the language generally does not participate in encoding training, and different modalities including vision and language cannot be effectively interacted and accurately aligned.

View Article and Find Full Text PDF

Visual-textual retrieval, as a link between computer vision and natural language processing, aims at jointly learning visual-semantic relevance to bridge the heterogeneity gap across visual and textual spaces. Existing methods conduct retrieval only relying on the ranking of pairwise similarities, but they cannot self-evaluate the uncertainty of retrieved results, resulting in unreliable retrieval and hindering interpretability. To address this problem, we propose a novel Trust-Consistent Learning framework (TCL) to endow visual-textual retrieval with uncertainty evaluation for trustworthy retrieval.

View Article and Find Full Text PDF

Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval.

View Article and Find Full Text PDF

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.

Neural Netw

April 2025

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China; Beijing Key Laboratory of Network System and Network Culture, Beijing, China.

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed.

View Article and Find Full Text PDF