98%
921
2 minutes
20
In this paper, we propose a novel Transformer based approach, namely Cross-modal Contrastive Masked AutoEncoder (C2MAE), to Self-Supervised Learning (SSL) on compressed videos. A unified Transformer encoder is employed to discover relationships of visual tokens from RGBs, motion vectors and residuals. A hybrid SSL framework is proposed, which combines the complementary advantages of Masked Image Modeling (MIM) and Contrastive Learning (CL) pretext tasks, for powerful representation learning. The MIM branch extends VideoMAE by a new Fine-Grained Motion-aware Masking (FGMM) strategy and a modified Multi-modal Reconstruction (MR) task, where FGMM computes motion saliency maps as motion priors to guide the masks so that it well fits for the data properties in the compressed domain and the MR task highlights the reconstruction of raw videos by joint representations from corresponding compressed videos in addition to that in each single modality. The CL branch introduces the Contrastive Cross-modal Learning (CCL) module, and the features from a compressed video clip and the ones from its raw video counterpart are compared instead of widely used augmented data. Due to these designs, C2MAE significantly enhances interactions across modalities to compensate the sparsity of I-frames and the coarse and noisy nature of P-frames, thus delivering much stronger pre-trained models. Extensive experiments are conducted on the UCF-101, HMDB-51 and Kinetics-400 benchmarks with state-of-the-art results reported, demonstrating its effectiveness.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2025.3583168 | DOI Listing |
Neural Netw
September 2025
Shanghai Maritime University, Shanghai, 201306, China. Electronic address:
Cross-modal hashing aims to leverage hashing functions to map multimodal data into a unified low-dimensional space, realizing efficient cross-modal retrieval. In particular, unsupervised cross-modal hashing methods attract significant attention for not needing external label information. However, in the field of unsupervised cross-modal hashing, there are several pressing issues to address: (1) how to facilitate semantic alignment between modalities, and (2) how to effectively capture the intrinsic relationships between data, thereby constructing a more reliable affinity matrix to assist in the learning of hash codes.
View Article and Find Full Text PDFJ Xray Sci Technol
September 2025
Center for Medical Artificial Intelligence, Shandong University of Traditional Chinese Medicine, Qingdao, China.
Parkinson's disease (PD) is a challenging neurodegenerative condition often prone to diagnostic errors, where early and accurate diagnosis is critical for effective clinical management. However, existing diagnostic methods often fail to fully exploit multimodal data or systematically incorporate expert domain knowledge. To address these limitations, we propose MKD-Net, a multimodal and knowledge-driven diagnostic framework that integrates imaging and non-imaging clinical data with structured expert insights to enhance diagnostic performance.
View Article and Find Full Text PDFBioinform Adv
August 2025
IBM Research, Yorktown Heights, NY, 10598, United States.
Motivation: Due to the intricate etiology of neurological disorders, finding interpretable associations between multiomics features can be challenging using standard approaches.
Results: We propose COMICAL, a contrastive learning approach using multiomics data to generate associations between genetic markers and brain imaging-derived phenotypes. COMICAL jointly learns omics representations utilizing transformer-based encoders with custom tokenizers.
Artif Intell Med
August 2025
University of Science and Technology of China, 230000, Hefei, China; Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, China. Electronic address:
The diagnosis of brain tumors is pivotal for effective treatment, with MRI serving as a commonly used non-invasive diagnostic modality in clinical practices. Fundamentally, brain tumor diagnosis is a type of pattern recognition task that requires the integration of information from multi-modal MRI images. However, existing fusion strategies are hindered by the scarcity of multi-modal imaging samples.
View Article and Find Full Text PDFIEEE J Biomed Health Inform
September 2025
The personalization of cancer treatment through drug combinations is critical for improving healthcare outcomes, increasing effectiveness, and reducing side effects. Computational methods have become increasingly important to prioritize synergistic drug pairs because of the vast search space of possible chemicals. However, existing approaches typically rely solely on global molecular structures, neglecting information exchange between different modality representations and interactions between molecular and fine-grained fragments, leading to limited understanding of drug synergy mechanisms for personalized treatment.
View Article and Find Full Text PDF