MGTR-MISS: More Ground Truth Retrieving based Multimodal Interaction and Semantic Supervision for video description.

Jiayu Zhang , Pengjie Tang , Yunlan Tan , Hanli Wang

Neural Netw

College of Electronic and Information Engineering, Tongji University, Shanghai 201804, PR China; School of Computer Science and Technology, Tongji University, Shanghai 201804, PR China; Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 201804

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Describing a video with accurate words and appropriate sentence patterns is interesting and challenging. Recently, excellent models have been proposed to generate smooth and semantically rich video descriptions. However, the language generally does not participate in encoding training, and different modalities including vision and language cannot be effectively interacted and accurately aligned. In this work, a novel model named MGTR-MISS which consists of multimodal interaction and semantic supervision, is proposed to generate more accurate and semantically rich video descriptions with the help of more ground truth. In detail, more external language knowledge is firstly retrieved from the ground truth corpus in the training set to capture richer linguistic semantics for the video. Then the visual features and retrieved linguistic features are fed into a multimodal interaction module to achieve effective interaction and accurate alignment between modalities. The output multimodal representation is then fed to a caption generator for language decoding with visual-textual attention and semantic supervision mechanisms. Experimental results on the popular MSVD, MSR-VTT and VATEX datasets show that our proposed MGTR-MISS outperforms not only the baseline model but also the recent state-of-the-art methods. Particularly, the CIDEr performances reach to 111.1 and 55.0 on MSVD and MSR-VTT respectively.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.neunet.2025.107817	DOI Listing

Publication Analysis

Top Keywords

ground truth

multimodal interaction

semantic supervision

interaction semantic

proposed generate

semantically rich

rich video

video descriptions

msvd msr-vtt

video

Similar Publications

Implementing a Resource-Light and Low-Code Large Language Model System for Information Extraction from Mammography Reports: A Pilot Study.

J Imaging Inform Med

September 2025

Department of Diagnostic, Interventional and Pediatric Radiology (DIPR), Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.

Fabio Dennstädt , Simon Fauser , Nikola Cihoric , Max Schmerder , Paolo Lombardo

Large language models (LLMs) have been successfully used for data extraction from free-text radiology reports. Most current studies were conducted with LLMs accessed via an application programming interface (API). We evaluated the feasibility of using open-source LLMs, deployed on limited local hardware resources for data extraction from free-text mammography reports, using a common data element (CDE)-based structure.

View Article and Find Full Text PDF

Similar Publications

Relevancy Aware Cascaded Generative Adversarial Network for LSO-transmission Image Denoising in CT-less PET.

Biomed Phys Eng Express

September 2025

Siemens Healthineers AG, 810 Innovation Dr, Knoxville, Tennessee, 37932-2562, UNITED STATES.

Chetana Krishnan , Mohammadreza Teimoorisichani

Achieving high-quality PET imaging while minimizing scan time and patient radiation dose presents significant challenges, particularly in the absence of CT-based attenuation maps. Joint reconstruction algorithms, such as MLAA and MLACF, partially address these challenges but often result in noisy and less reliable images. Denoising these images is critical for enhancing diagnostic accuracy.

View Article and Find Full Text PDF

Similar Publications

Data-free knowledge distillation via text-noise fusion and dynamic adversarial temperature.

Neural Netw

September 2025

School of Computer Science, South China Normal University, Guangzhou, 510631, Guangdong, China; School of Artificial Intelligence, South China Normal University, Foshan, 528225, Guangdong, China. Electronic address:

Deheng Zeng , Zhengyang Wu , Yunwen Chen , Zhenhua Huang

Data-Free Knowledge Distillation (DFKD) have achieved significant breakthroughs, enabling the effective transfer of knowledge from teacher neural networks to student neural networks without reliance on original data. However, a significant challenge faced by existing methods that attempt to generate samples from random noise is that the noise lacks meaningful information, such as class-specific semantic information. Consequently, the absence of meaningful information makes it difficult for the generator to map this noise to the ground-truth data distribution, resulting in the generation of low-quality training samples.

View Article and Find Full Text PDF

Similar Publications

Leveraging GPT-4o for Automated Extraction and Categorization of CAD-RADS Features From Free-Text Coronary CT Angiography Reports: Diagnostic Study.

JMIR Med Inform

September 2025

Departments of Radiology, The Third Affiliated Hospital, Sun Yat-Sen University, 600 Tianhe Road, Guangzhou, Guangdong, 510630, China, 86 18922109279, 86 20852523108.

Youmei Chen , Mengshi Dong , Jie Sun , Zhanao Meng , Yiqing Yang

Background: Despite the Coronary Artery Reporting and Data System (CAD-RADS) providing a standardized approach, radiologists continue to favor free-text reports. This preference creates significant challenges for data extraction and analysis in longitudinal studies, potentially limiting large-scale research and quality assessment initiatives.

Objective: To evaluate the ability of the generative pre-trained transformer (GPT)-4o model to convert real-world coronary computed tomography angiography (CCTA) free-text reports into structured data and automatically identify CAD-RADS categories and P categories.

View Article and Find Full Text PDF

Similar Publications

Unifying DNA methylation-based cell-type deconvolution with .

Bioinform Adv

September 2025

Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, 85354, Germany.

Alexander Dietrich , Lina-Liv Willruth , Korbinian Pürckhauer , Carlos Oltmanns , Moana Witte

Summary: Cell-type deconvolution is widely applied to gene expression and DNA methylation data, but access to methods for the latter remains limited. We introduce , a new R package that simplifies access to DNA methylation-based deconvolution methods predominantly for blood data, and we additionally compare their estimates to those from gene expression and experimental ground truth data using a unique matched blood dataset.

Availability And Implementation: is available at https://github.

View Article and Find Full Text PDF

Similar Publications