Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

In this paper, we propose a novel Transformer based approach, namely Cross-modal Contrastive Masked AutoEncoder (C2MAE), to Self-Supervised Learning (SSL) on compressed videos. A unified Transformer encoder is employed to discover relationships of visual tokens from RGBs, motion vectors and residuals. A hybrid SSL framework is proposed, which combines the complementary advantages of Masked Image Modeling (MIM) and Contrastive Learning (CL) pretext tasks, for powerful representation learning. The MIM branch extends VideoMAE by a new Fine-Grained Motion-aware Masking (FGMM) strategy and a modified Multi-modal Reconstruction (MR) task, where FGMM computes motion saliency maps as motion priors to guide the masks so that it well fits for the data properties in the compressed domain and the MR task highlights the reconstruction of raw videos by joint representations from corresponding compressed videos in addition to that in each single modality. The CL branch introduces the Contrastive Cross-modal Learning (CCL) module, and the features from a compressed video clip and the ones from its raw video counterpart are compared instead of widely used augmented data. Due to these designs, C2MAE significantly enhances interactions across modalities to compensate the sparsity of I-frames and the coarse and noisy nature of P-frames, thus delivering much stronger pre-trained models. Extensive experiments are conducted on the UCF-101, HMDB-51 and Kinetics-400 benchmarks with state-of-the-art results reported, demonstrating its effectiveness.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2025.3583168DOI Listing

Publication Analysis

Top Keywords

cross-modal contrastive
8
contrastive masked
8
masked autoencoder
8
compressed video
8
compressed videos
8
compressed
5
autoencoder compressed
4
video pre-training
4
pre-training paper
4
paper propose
4

Similar Publications

Cross-modal hashing aims to leverage hashing functions to map multimodal data into a unified low-dimensional space, realizing efficient cross-modal retrieval. In particular, unsupervised cross-modal hashing methods attract significant attention for not needing external label information. However, in the field of unsupervised cross-modal hashing, there are several pressing issues to address: (1) how to facilitate semantic alignment between modalities, and (2) how to effectively capture the intrinsic relationships between data, thereby constructing a more reliable affinity matrix to assist in the learning of hash codes.

View Article and Find Full Text PDF

Parkinson's disease (PD) is a challenging neurodegenerative condition often prone to diagnostic errors, where early and accurate diagnosis is critical for effective clinical management. However, existing diagnostic methods often fail to fully exploit multimodal data or systematically incorporate expert domain knowledge. To address these limitations, we propose MKD-Net, a multimodal and knowledge-driven diagnostic framework that integrates imaging and non-imaging clinical data with structured expert insights to enhance diagnostic performance.

View Article and Find Full Text PDF

Motivation: Due to the intricate etiology of neurological disorders, finding interpretable associations between multiomics features can be challenging using standard approaches.

Results: We propose COMICAL, a contrastive learning approach using multiomics data to generate associations between genetic markers and brain imaging-derived phenotypes. COMICAL jointly learns omics representations utilizing transformer-based encoders with custom tokenizers.

View Article and Find Full Text PDF

MMSupcon: An image fusion-based multi-modal supervised contrastive method for brain tumor diagnosis.

Artif Intell Med

August 2025

University of Science and Technology of China, 230000, Hefei, China; Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, China. Electronic address:

The diagnosis of brain tumors is pivotal for effective treatment, with MRI serving as a commonly used non-invasive diagnostic modality in clinical practices. Fundamentally, brain tumor diagnosis is a type of pattern recognition task that requires the integration of information from multi-modal MRI images. However, existing fusion strategies are hindered by the scarcity of multi-modal imaging samples.

View Article and Find Full Text PDF

The personalization of cancer treatment through drug combinations is critical for improving healthcare outcomes, increasing effectiveness, and reducing side effects. Computational methods have become increasingly important to prioritize synergistic drug pairs because of the vast search space of possible chemicals. However, existing approaches typically rely solely on global molecular structures, neglecting information exchange between different modality representations and interactions between molecular and fine-grained fragments, leading to limited understanding of drug synergy mechanisms for personalized treatment.

View Article and Find Full Text PDF