Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2023.3275071DOI Listing

Publication Analysis

Top Keywords

momentum contrast
16
hierarchical matching
12
text-video retrieval
12
end-to-end pre-training
8
matching momentum
8
video-language pre-training
8
hierarchical semantic
8
semantic videos
8
video semantic
8
multilevel semantic
8

Similar Publications

In this paper, we study the impact of momentum, volume and investor sentiment on U.S. tech sector stock returns using Principal Component Analysis-Hidden Markov Model (PCA-HMM) methodology.

View Article and Find Full Text PDF

Extensive research explores the relationship between deepening conflict over socio-cultural issues and stagnating social mobility, typically focusing on men. Upwardly mobile women are routinely mentioned as belonging to the progressive "winners" of the knowledge-based society, yet their experiences and politics have received far less attention. This paper theorizes and investigates how women view their individual and collective trajectories and how these views relate to perceptions of future opportunities and political attitudes.

View Article and Find Full Text PDF

Chiral vortex beams with tunable topological charges (TCs) hold promise for high-capacity and multi-channel information transmission. However, asymmetric vortex transport, a crucial feature for enhancing robustness and security, often disrupts channel independence by altering TCs, causing signal distortion. Here, we exploit the radial mode degree of freedom in chiral space to achieve extremely asymmetric transmission with high energy contrast, while preserving chirality and TCs.

View Article and Find Full Text PDF

Observation of nonlinear edge states in an interacting atomic trimer array.

Light Sci Appl

August 2025

State Key Laboratory of Quantum Optics Technologies and Devices, Institute of Laser Spectroscopy, Shanxi University, Taiyuan, 030006, China.

Exploring the interplay between topology and nonlinearity leads to an emerging field of nonlinear topological physics, which extends the study of fascinating properties of topological states to a regime where interactions between the particles cannot be neglected. For ultracold atomic systems, although many exotic topological states have been recently observed, the nonlinear effect remains elusive. Here, based on the laser-driven couplings of discrete atomic momentum states, we synthesize a topological trimer array, where the atomic interactions give rise to tunable nonlinearities.

View Article and Find Full Text PDF

Captioner: Improving change captioning by leveraging momentum cross-view and cross-modality contrastive learning.

Neural Netw

August 2025

National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, 610064, PR China; College of Computer Science, Sichuan University, Chengdu, 610065, PR China. Electronic address:

The primary goal of change captioning is to identify subtle visual differences between two similar images and express them in natural language. Existing research has been significantly influenced by the task of vision change detection and has mainly concentrated on the identification and description of visual changes. However, we contend that an effective change captioner should go beyond mere detection and description of what has changed.

View Article and Find Full Text PDF