UICD: A new dataset and approach for urdu image captioning.

Rimsha Muzaffar , Syed Yasser Arafat , Junaid Rashid , Jungeun Kim , Usman Naseem

PLoS One

School of Computing, Macquarie University, Sydney, Australia.

Published: June 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model's impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12129323	PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0320701	PLOS

Publication Analysis

Top Keywords

image captioning

urdu image

deep learning

image

urdu

captioning

nasnetlarge-lstm resnet-50-lstm

uicd dataset

dataset approach

approach urdu

Similar Publications

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations.

IEEE Trans Pattern Anal Mach Intell

September 2025

Cheng Lei , Jie Fan , Xinran Li , Tian-Zhu Xiang , Ao Li

Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?", we propose an affirmative solution. We analyze the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework.

View Article and Find Full Text PDF

Similar Publications

Ensemble deep learning with image captioning for visual pollution detection, classification, and reporting.

Sci Rep

August 2025

Computer Science Department, College of Computing and Informatics, Saudi Electronic University, 11673, Riyadh, Saudi Arabia.

Haya Almalki , Nahlah Algethami

With the rapid urban development and initiatives such as Saudi Vision 2030, efforts have been directed toward improving services and quality of life in Saudi cities. As a result, multiple environmental challenges have emerged, including visual pollution (VP), which significantly impacts the quality of life. Current approaches to these challenges rely on reporting through an online application managed by the Ministry of Municipalities and Housing, which is prone to errors due to manual data entry.

View Article and Find Full Text PDF

Similar Publications

Captioner: Improving change captioning by leveraging momentum cross-view and cross-modality contrastive learning.

Neural Netw

August 2025

National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, 610064, PR China; College of Computer Science, Sichuan University, Chengdu, 610065, PR China. Electronic address:

Lin Deng , Borui Kang , Yuzhong Zhong , Maoning Wang , Jianwei Zhang

The primary goal of change captioning is to identify subtle visual differences between two similar images and express them in natural language. Existing research has been significantly influenced by the task of vision change detection and has mainly concentrated on the identification and description of visual changes. However, we contend that an effective change captioner should go beyond mere detection and description of what has changed.

View Article and Find Full Text PDF

Similar Publications

High-level visual representations in the human brain are aligned with large language models.

Nat Mach Intell

August 2025

cerebrUM, Département de Psychologie, Université de Montréal, Montreal, Quebec Canada.

Adrien Doerig , Tim C Kietzmann , Emily Allen , Yihan Wu , Thomas Naselaris

The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here we test whether the contextual information encoded in large language models (LLMs) is beneficial for modelling the complex visual information extracted by the brain from natural scenes.

View Article and Find Full Text PDF

Similar Publications

Improving Video Summarization by Exploring the Coherence Between Corresponding Captions.

IEEE Trans Image Process

January 2025

Cheng Ye , Weidong Chen , Bo Hu , Lei Zhang , Yongdong Zhang

Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing.

View Article and Find Full Text PDF

Similar Publications