98%
921
2 minutes
20
Describing a video with accurate words and appropriate sentence patterns is interesting and challenging. Recently, excellent models have been proposed to generate smooth and semantically rich video descriptions. However, the language generally does not participate in encoding training, and different modalities including vision and language cannot be effectively interacted and accurately aligned. In this work, a novel model named MGTR-MISS which consists of multimodal interaction and semantic supervision, is proposed to generate more accurate and semantically rich video descriptions with the help of more ground truth. In detail, more external language knowledge is firstly retrieved from the ground truth corpus in the training set to capture richer linguistic semantics for the video. Then the visual features and retrieved linguistic features are fed into a multimodal interaction module to achieve effective interaction and accurate alignment between modalities. The output multimodal representation is then fed to a caption generator for language decoding with visual-textual attention and semantic supervision mechanisms. Experimental results on the popular MSVD, MSR-VTT and VATEX datasets show that our proposed MGTR-MISS outperforms not only the baseline model but also the recent state-of-the-art methods. Particularly, the CIDEr performances reach to 111.1 and 55.0 on MSVD and MSR-VTT respectively.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.neunet.2025.107817 | DOI Listing |
J Imaging Inform Med
September 2025
Department of Diagnostic, Interventional and Pediatric Radiology (DIPR), Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.
Large language models (LLMs) have been successfully used for data extraction from free-text radiology reports. Most current studies were conducted with LLMs accessed via an application programming interface (API). We evaluated the feasibility of using open-source LLMs, deployed on limited local hardware resources for data extraction from free-text mammography reports, using a common data element (CDE)-based structure.
View Article and Find Full Text PDFBiomed Phys Eng Express
September 2025
Siemens Healthineers AG, 810 Innovation Dr, Knoxville, Tennessee, 37932-2562, UNITED STATES.
Achieving high-quality PET imaging while minimizing scan time and patient radiation dose presents significant challenges, particularly in the absence of CT-based attenuation maps. Joint reconstruction algorithms, such as MLAA and MLACF, partially address these challenges but often result in noisy and less reliable images. Denoising these images is critical for enhancing diagnostic accuracy.
View Article and Find Full Text PDFNeural Netw
September 2025
School of Computer Science, South China Normal University, Guangzhou, 510631, Guangdong, China; School of Artificial Intelligence, South China Normal University, Foshan, 528225, Guangdong, China. Electronic address:
Data-Free Knowledge Distillation (DFKD) have achieved significant breakthroughs, enabling the effective transfer of knowledge from teacher neural networks to student neural networks without reliance on original data. However, a significant challenge faced by existing methods that attempt to generate samples from random noise is that the noise lacks meaningful information, such as class-specific semantic information. Consequently, the absence of meaningful information makes it difficult for the generator to map this noise to the ground-truth data distribution, resulting in the generation of low-quality training samples.
View Article and Find Full Text PDFJMIR Med Inform
September 2025
Departments of Radiology, The Third Affiliated Hospital, Sun Yat-Sen University, 600 Tianhe Road, Guangzhou, Guangdong, 510630, China, 86 18922109279, 86 20852523108.
Background: Despite the Coronary Artery Reporting and Data System (CAD-RADS) providing a standardized approach, radiologists continue to favor free-text reports. This preference creates significant challenges for data extraction and analysis in longitudinal studies, potentially limiting large-scale research and quality assessment initiatives.
Objective: To evaluate the ability of the generative pre-trained transformer (GPT)-4o model to convert real-world coronary computed tomography angiography (CCTA) free-text reports into structured data and automatically identify CAD-RADS categories and P categories.
Bioinform Adv
September 2025
Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, 85354, Germany.
Summary: Cell-type deconvolution is widely applied to gene expression and DNA methylation data, but access to methods for the latter remains limited. We introduce , a new R package that simplifies access to DNA methylation-based deconvolution methods predominantly for blood data, and we additionally compare their estimates to those from gene expression and experimental ground truth data using a unique matched blood dataset.
Availability And Implementation: is available at https://github.