98%
921
2 minutes
20
Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2025.3525526 | DOI Listing |
Sensors (Basel)
August 2025
Department of Electronic Engineering, Yeungnam University, Gyeongsan-si 38541, Republic of Korea.
Urdu and English are widely used for visual text communications worldwide in public spaces such as signboards and navigation boards. Text in such natural scenes contains useful information for modern-era applications such as language translation for foreign visitors, robot navigation, and autonomous vehicles, highlighting the importance of extracting these texts. Previous studies focused on Urdu alone or printed text pasted manually on images and lacked sufficiently large datasets for effective model training.
View Article and Find Full Text PDFSensors (Basel)
August 2025
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China.
Language-guided multimodal fusion, which integrates information from both visible and infrared images, has shown strong performance in image fusion tasks. In low-light or complex environments, a single modality often fails to fully capture scene features, whereas fused images enable robots to obtain multidimensional scene understanding for navigation, localization, and environmental perception. This capability is particularly important in applications such as autonomous driving, intelligent surveillance, and search-and-rescue operations, where accurate recognition and efficient decision-making are critical.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2025
Recent diffusion models have demonstrated remarkable performance in both 3D scene generation and perception tasks. Nevertheless, existing methods typically separate these two processes, acting as a data augmenter to generate synthetic data for downstream perception tasks. In this work, we propose OccScene, a novel mutual learning paradigm that integrates fine-grained 3D perception and high-quality generation in a unified framework, achieving a cross-task win-win effect.
View Article and Find Full Text PDFNeural Netw
August 2025
School of Advanced Technology, Xi'an Jiaotong-Liverpool University, Suzhou, 215123, China. Electronic address:
Class-agnostic counting is increasingly prevalent in industrial and agricultural applications. However, most deployable methods rely on density maps, which (1) struggle with background interference in complex scenes, and (2) fail to provide precise object locations, limiting downstream usability. The advancement of class-agnostic counting is hindered by suboptimal model designs and the lack of datasets with bounding box annotations.
View Article and Find Full Text PDFPsychol Trauma
August 2025
Department of Psychology, Psychological Trauma and Stress Systems Lab, Faculty of Arts, University of Regina.
Objective: Public safety personnel (PSP) frequently screen positive for posttraumatic stress disorder (PTSD) based on the PTSD Checklist for (PCL-5). Approximately 30% of Canadian paramedics who might otherwise screen positively for PTSD using the PCL-5 may not because avoidance items are not endorsed, arguably as a function of their service requirements (e.g.
View Article and Find Full Text PDF