Instruction-Guided Scene Text Recognition.

Yongkun Du , Zhineng Chen , Yuchen Su , Caiyan Jia , Yu-Gang Jiang

IEEE Trans Pattern Anal Mach Intell

Published: April 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2025.3525526	DOI Listing

Publication Analysis

Top Keywords

scene text

text recognition

instruction-guided scene

text images

character attributes

text

recognition

igtr

recognition multi-modal

multi-modal models

Similar Publications

Detection and Recognition of Bilingual Urdu and English Text in Natural Scene Images Using a Convolutional Neural Network-Recurrent Neural Network Combination with a Connectionist Temporal Classification Decoder.

Sensors (Basel)

August 2025

Department of Electronic Engineering, Yeungnam University, Gyeongsan-si 38541, Republic of Korea.

Khadija Tul Kubra , Muhammad Umair , Muhammad Zubair , Muhammad Tahir Naseem , Chan-Su Lee

Urdu and English are widely used for visual text communications worldwide in public spaces such as signboards and navigation boards. Text in such natural scenes contains useful information for modern-era applications such as language translation for foreign visitors, robot navigation, and autonomous vehicles, highlighting the importance of extracting these texts. Previous studies focused on Urdu alone or printed text pasted manually on images and lacked sufficiently large datasets for effective model training.

View Article and Find Full Text PDF

Similar Publications

Language-Driven Cross-Attention for Visible-Infrared Image Fusion Using CLIP.

Sensors (Basel)

August 2025

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China.

Xue Wang , Jiatong Wu , Pengfei Zhang , Zhongjun Yu

Language-guided multimodal fusion, which integrates information from both visible and infrared images, has shown strong performance in image fusion tasks. In low-light or complex environments, a single modality often fails to fully capture scene features, whereas fused images enable robots to obtain multidimensional scene understanding for navigation, localization, and environmental perception. This capability is particularly important in applications such as autonomous driving, intelligent surveillance, and search-and-rescue operations, where accurate recognition and efficient decision-making are critical.

View Article and Find Full Text PDF

Similar Publications

OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation.

IEEE Trans Pattern Anal Mach Intell

August 2025

Bohan Li , Xin Jin , Jianan Wang , Yukai Shi , Yasheng Sun

Recent diffusion models have demonstrated remarkable performance in both 3D scene generation and perception tasks. Nevertheless, existing methods typically separate these two processes, acting as a data augmenter to generate synthetic data for downstream perception tasks. In this work, we propose OccScene, a novel mutual learning paradigm that integrates fine-grained 3D perception and high-quality generation in a unified framework, achieving a cross-task win-win effect.

View Article and Find Full Text PDF

Similar Publications

Counting with ease: Class-agnostic counting via one-shot detection across diverse domains.

Neural Netw

August 2025

School of Advanced Technology, Xi'an Jiaotong-Liverpool University, Suzhou, 215123, China. Electronic address:

Zhongxing Peng , Bohui Guo , Shugong Xu

Class-agnostic counting is increasingly prevalent in industrial and agricultural applications. However, most deployable methods rely on density maps, which (1) struggle with background interference in complex scenes, and (2) fail to provide precise object locations, limiting downstream usability. The advancement of class-agnostic counting is hindered by suboptimal model designs and the lack of datasets with bounding box annotations.

View Article and Find Full Text PDF

Similar Publications

Screening positively for PTSD: Examining the role of avoidance for public safety personnel.

Psychol Trauma

August 2025

Department of Psychology, Psychological Trauma and Stress Systems Lab, Faculty of Arts, University of Regina.

Robyn E Shields , Terence M Keane , Blake A E Boehme , Gordon J G Asmundson , R Nicholas Carleton

Objective: Public safety personnel (PSP) frequently screen positive for posttraumatic stress disorder (PTSD) based on the PTSD Checklist for (PCL-5). Approximately 30% of Canadian paramedics who might otherwise screen positively for PTSD using the PCL-5 may not because avoidance items are not endorsed, arguably as a function of their service requirements (e.g.

View Article and Find Full Text PDF

Similar Publications