Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning.

Peipei Song , Dan Guo , Jinxing Zhou , Mingliang Xu , Meng Wang

IEEE Trans Cybern

Published: July 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Most works of image captioning are implemented under the full supervision of paired image-caption data. Limited to expensive cost of data collection, the task of unpaired image captioning has attracted researchers' attention. In this article, we propose a novel memorial GAN (MemGAN) with the joint semantic optimization for unpaired image captioning. The core idea is to explore implicit semantic correlation between disjointed images and sentences through building a multimodal semantic-aware space (SAS). Concretely, each modality is mapped into a unified multimodal SAS, where SAS includes the semantic vectors of image I , visual concepts O , unpaired sentence S , and the generated caption C . We adopt the memory unit based on multihead attention and relational gate as a backbone to preserve and transit crucial multimodal semantics in the SAS for image caption generation and sentence reconstruction. Then, the memory unit is embedded into a GAN framework to exploit the semantic similarity and relevance in SAS, that is, imposing a joint semantic-aware optimization on SAS without supervision clues. To summarize, the proposed MemGAN learns the latent semantic relevance of SAS's multimodalities in an adversarial manner. Extensive experiments and qualitative results demonstrate the effectiveness of MemGAN, achieving improvements over state of the arts on unpaired image captioning benchmarks.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TCYB.2022.3175012	DOI Listing

Publication Analysis

Top Keywords

image captioning

unpaired image

memorial gan

joint semantic

semantic optimization

optimization unpaired

memory unit

image

semantic

sas

Similar Publications

Ensemble deep learning with image captioning for visual pollution detection, classification, and reporting.

Sci Rep

August 2025

Computer Science Department, College of Computing and Informatics, Saudi Electronic University, 11673, Riyadh, Saudi Arabia.

Haya Almalki , Nahlah Algethami

With the rapid urban development and initiatives such as Saudi Vision 2030, efforts have been directed toward improving services and quality of life in Saudi cities. As a result, multiple environmental challenges have emerged, including visual pollution (VP), which significantly impacts the quality of life. Current approaches to these challenges rely on reporting through an online application managed by the Ministry of Municipalities and Housing, which is prone to errors due to manual data entry.

View Article and Find Full Text PDF

Similar Publications

Captioner: Improving change captioning by leveraging momentum cross-view and cross-modality contrastive learning.

Neural Netw

August 2025

National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, 610064, PR China; College of Computer Science, Sichuan University, Chengdu, 610065, PR China. Electronic address:

Lin Deng , Borui Kang , Yuzhong Zhong , Maoning Wang , Jianwei Zhang

The primary goal of change captioning is to identify subtle visual differences between two similar images and express them in natural language. Existing research has been significantly influenced by the task of vision change detection and has mainly concentrated on the identification and description of visual changes. However, we contend that an effective change captioner should go beyond mere detection and description of what has changed.

View Article and Find Full Text PDF

Similar Publications

High-level visual representations in the human brain are aligned with large language models.

Nat Mach Intell

August 2025

cerebrUM, Département de Psychologie, Université de Montréal, Montreal, Quebec Canada.

Adrien Doerig , Tim C Kietzmann , Emily Allen , Yihan Wu , Thomas Naselaris

The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here we test whether the contextual information encoded in large language models (LLMs) is beneficial for modelling the complex visual information extracted by the brain from natural scenes.

View Article and Find Full Text PDF

Similar Publications

Improving Video Summarization by Exploring the Coherence Between Corresponding Captions.

IEEE Trans Image Process

January 2025

Cheng Ye , Weidong Chen , Bo Hu , Lei Zhang , Yongdong Zhang

Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing.

View Article and Find Full Text PDF

Similar Publications

MinD-3D++: Advancing fMRI-Based 3D Reconstruction With High-Quality Textured Mesh Generation and a Comprehensive Dataset.

IEEE Trans Pattern Anal Mach Intell

August 2025

Jianxiong Gao , Yanwei Fu , Yuqian Fu , Yun Wang , Xuelin Qian

Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4,768 3D objects. The dataset consists of two components: fMRI-Shape, previously introduced and available at https://huggingface.

View Article and Find Full Text PDF

Similar Publications