Cross-Modal Contrastive Masked AutoEncoder for Compressed Video Pre-Training.

Bing Li , Jiaxin Chen , Guohao Li , Dongming Zhang , Xiuguo Bao , Di Huang

IEEE Trans Image Process

Published: January 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

In this paper, we propose a novel Transformer based approach, namely Cross-modal Contrastive Masked AutoEncoder (C2MAE), to Self-Supervised Learning (SSL) on compressed videos. A unified Transformer encoder is employed to discover relationships of visual tokens from RGBs, motion vectors and residuals. A hybrid SSL framework is proposed, which combines the complementary advantages of Masked Image Modeling (MIM) and Contrastive Learning (CL) pretext tasks, for powerful representation learning. The MIM branch extends VideoMAE by a new Fine-Grained Motion-aware Masking (FGMM) strategy and a modified Multi-modal Reconstruction (MR) task, where FGMM computes motion saliency maps as motion priors to guide the masks so that it well fits for the data properties in the compressed domain and the MR task highlights the reconstruction of raw videos by joint representations from corresponding compressed videos in addition to that in each single modality. The CL branch introduces the Contrastive Cross-modal Learning (CCL) module, and the features from a compressed video clip and the ones from its raw video counterpart are compared instead of widely used augmented data. Due to these designs, C2MAE significantly enhances interactions across modalities to compensate the sparsity of I-frames and the coarse and noisy nature of P-frames, thus delivering much stronger pre-trained models. Extensive experiments are conducted on the UCF-101, HMDB-51 and Kinetics-400 benchmarks with state-of-the-art results reported, demonstrating its effectiveness.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2025.3583168	DOI Listing

Publication Analysis

Top Keywords

cross-modal contrastive

contrastive masked

masked autoencoder

compressed video

compressed videos

compressed

autoencoder compressed

video pre-training

pre-training paper

paper propose

Similar Publications

Dual aggregation based joint-modal similarity hashing for cross-modal retrieval.

Neural Netw

September 2025

Shanghai Maritime University, Shanghai, 201306, China. Electronic address:

Le Xu , Jun Yin

Cross-modal hashing aims to leverage hashing functions to map multimodal data into a unified low-dimensional space, realizing efficient cross-modal retrieval. In particular, unsupervised cross-modal hashing methods attract significant attention for not needing external label information. However, in the field of unsupervised cross-modal hashing, there are several pressing issues to address: (1) how to facilitate semantic alignment between modalities, and (2) how to effectively capture the intrinsic relationships between data, thereby constructing a more reliable affinity matrix to assist in the learning of hash codes.

View Article and Find Full Text PDF

Similar Publications

MKD-Net: A multimodal multi-domain knowledge-driven framework for Parkinson's disease diagnosis.

J Xray Sci Technol

September 2025

Center for Medical Artificial Intelligence, Shandong University of Traditional Chinese Medicine, Qingdao, China.

Xiangze Teng , Xiang Li , Benzheng Wei

Parkinson's disease (PD) is a challenging neurodegenerative condition often prone to diagnostic errors, where early and accurate diagnosis is critical for effective clinical management. However, existing diagnostic methods often fail to fully exploit multimodal data or systematically incorporate expert domain knowledge. To address these limitations, we propose MKD-Net, a multimodal and knowledge-driven diagnostic framework that integrates imaging and non-imaging clinical data with structured expert insights to enhance diagnostic performance.

View Article and Find Full Text PDF

Similar Publications

A foundation model for learning genetic associations from brain imaging phenotypes.

Bioinform Adv

August 2025

IBM Research, Yorktown Heights, NY, 10598, United States.

Diego Machado Reyes , Myson Burch , Laxmi Parida , Aritra Bose

Motivation: Due to the intricate etiology of neurological disorders, finding interpretable associations between multiomics features can be challenging using standard approaches.

Results: We propose COMICAL, a contrastive learning approach using multiomics data to generate associations between genetic markers and brain imaging-derived phenotypes. COMICAL jointly learns omics representations utilizing transformer-based encoders with custom tokenizers.

View Article and Find Full Text PDF

Similar Publications

MMSupcon: An image fusion-based multi-modal supervised contrastive method for brain tumor diagnosis.

Artif Intell Med

August 2025

University of Science and Technology of China, 230000, Hefei, China; Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, China. Electronic address:

Haoyu Wang , Jing Zhang , Siying Wu , Haoran Wei , Xun Chen

The diagnosis of brain tumors is pivotal for effective treatment, with MRI serving as a commonly used non-invasive diagnostic modality in clinical practices. Fundamentally, brain tumor diagnosis is a type of pattern recognition task that requires the integration of information from multi-modal MRI images. However, existing fusion strategies are hindered by the scarcity of multi-modal imaging samples.

View Article and Find Full Text PDF

Similar Publications

MOSAIC: A Multi-Granularity Cross-Modal Framework for Predicting Synergistic Drug Combinations in Personalized Healthcare.

IEEE J Biomed Health Inform

September 2025

Licai Zhang , Xiao Kang , Xinxing Yang , Lin Wang , Genke Yang

The personalization of cancer treatment through drug combinations is critical for improving healthcare outcomes, increasing effectiveness, and reducing side effects. Computational methods have become increasingly important to prioritize synergistic drug pairs because of the vast search space of possible chemicals. However, existing approaches typically rely solely on global molecular structures, neglecting information exchange between different modality representations and interactions between molecular and fine-grained fragments, leading to limited understanding of drug synergy mechanisms for personalized treatment.

View Article and Find Full Text PDF

Similar Publications