End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval.

Wenxue Shen , Jingkuan Song , Xiaosu Zhu , Gongfu Li , Heng Tao Shen

IEEE Trans Image Process

Published: September 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2023.3275071	DOI Listing

Publication Analysis

Top Keywords

momentum contrast

hierarchical matching

text-video retrieval

end-to-end pre-training

matching momentum

video-language pre-training

hierarchical semantic

semantic videos

video semantic

multilevel semantic

Similar Publications

Momentum, volume and investor sentiment study for u.s. technology sector stocks-A hidden markov model based principal component analysis.

PLoS One

September 2025

Department of Economics, Cornell University, Ithaca, United States of America.

Shaoshu Li

In this paper, we study the impact of momentum, volume and investor sentiment on U.S. tech sector stock returns using Principal Component Analysis-Hidden Markov Model (PCA-HMM) methodology.

View Article and Find Full Text PDF

Similar Publications

Perceptions of Social Mobility, Gender, and Progressive Politics.

Comp Polit Stud

October 2025

University of Zurich, Zurich, Switzerland.

Briitta van Staalduinen , Delia Zollinger

Extensive research explores the relationship between deepening conflict over socio-cultural issues and stagnating social mobility, typically focusing on men. Upwardly mobile women are routinely mentioned as belonging to the progressive "winners" of the knowledge-based society, yet their experiences and politics have received far less attention. This paper theorizes and investigates how women view their individual and collective trajectories and how these views relate to perceptions of future opportunities and political attitudes.

View Article and Find Full Text PDF

Similar Publications

Chirality-protected extreme asymmetric acoustic information transport with noise immunity.

Nat Commun

August 2025

Institute of Acoustics, Tongji University, Shanghai, China.

Quansen Wang , Chun Liu , Chao Song , Hua Ding , Xu Wang

Chiral vortex beams with tunable topological charges (TCs) hold promise for high-capacity and multi-channel information transmission. However, asymmetric vortex transport, a crucial feature for enhancing robustness and security, often disrupts channel independence by altering TCs, causing signal distortion. Here, we exploit the radial mode degree of freedom in chiral space to achieve extremely asymmetric transmission with high energy contrast, while preserving chirality and TCs.

View Article and Find Full Text PDF

Similar Publications

Observation of nonlinear edge states in an interacting atomic trimer array.

Light Sci Appl

August 2025

State Key Laboratory of Quantum Optics Technologies and Devices, Institute of Laser Spectroscopy, Shanxi University, Taiyuan, 030006, China.

Huiying Du , Hongxing Zhao , Yuqing Li , Yunfei Wang , Rujiang Li

Exploring the interplay between topology and nonlinearity leads to an emerging field of nonlinear topological physics, which extends the study of fascinating properties of topological states to a regime where interactions between the particles cannot be neglected. For ultracold atomic systems, although many exotic topological states have been recently observed, the nonlinear effect remains elusive. Here, based on the laser-driven couplings of discrete atomic momentum states, we synthesize a topological trimer array, where the atomic interactions give rise to tunable nonlinearities.

View Article and Find Full Text PDF

Similar Publications

Captioner: Improving change captioning by leveraging momentum cross-view and cross-modality contrastive learning.

Neural Netw

August 2025

National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, 610064, PR China; College of Computer Science, Sichuan University, Chengdu, 610065, PR China. Electronic address:

Lin Deng , Borui Kang , Yuzhong Zhong , Maoning Wang , Jianwei Zhang

The primary goal of change captioning is to identify subtle visual differences between two similar images and express them in natural language. Existing research has been significantly influenced by the task of vision change detection and has mainly concentrated on the identification and description of visual changes. However, we contend that an effective change captioner should go beyond mere detection and description of what has changed.

View Article and Find Full Text PDF

Similar Publications