98%
921
2 minutes
20
The existing publicly available datasets with pixel-level labels contain limited categories, and it is difficult to generalize to the real world containing thousands of categories. In this paper, we propose an approach to generate object masks with detailed pixel-level structures/boundaries automatically to enable semantic image segmentation of thousands of targets in the real world without manually labelling. A Guided Filter Network (GFN) is first developed to learn the segmentation knowledge from an existed dataset, and such GFN then transfers the learned segmentation knowledge to generate initial coarse object masks for the target images. These coarse object masks are treated as pseudo labels to self-optimize the GFN iteratively in the target images. Our experiments on six image sets have demonstrated that our proposed approach can generate object masks with detailed pixel-level structures/boundaries, whose quality is comparable to the manually-labelled ones. Our proposed approach also achieves better performance on semantic image segmentation than most existing weakly-supervised, semi-supervised, and domain adaptation approaches under the same experimental conditions.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2022.3160399 | DOI Listing |
IEEE Trans Pattern Anal Mach Intell
September 2025
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process.
View Article and Find Full Text PDFSensors (Basel)
August 2025
School of Transportation, Southeast University, Nanjing 211189, China.
Foreign Object Debris (FOD) on airport pavements poses a serious threat to aviation safety, making accurate detection and interpretable scene understanding crucial for operational risk management. This paper presents an integrated multi-modal framework that combines an enhanced YOLOv7-X detector, a cascaded YOLO-SAM segmentation module, and a structured prompt engineering mechanism to generate detailed semantic descriptions of detected FOD. Detection performance is improved through the integration of Coordinate Attention, Spatial-Depth Conversion (SPD-Conv), and a Gaussian Similarity IoU (GSIoU) loss, leading to a 3.
View Article and Find Full Text PDFSensors (Basel)
August 2025
College of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an 710021, China.
In GNSS-deprived settings, such as indoor and underground environments, research on simultaneous localization and mapping (SLAM) technology remains a focal point. Addressing the influence of dynamic variables on positional precision and constructing a persistent map comprising solely static elements are pivotal objectives in visual SLAM for dynamic scenes. This paper introduces optical flow motion segmentation-based SLAM(OS-SLAM), a dynamic environment SLAM system that incorporates optical flow motion segmentation for enhanced robustness.
View Article and Find Full Text PDFSensors (Basel)
August 2025
College of Electronics and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China.
To address the challenging problem of multi-scale inshore-offshore ship detection in synthetic aperture radar (SAR) remote sensing images, we propose a novel deep learning-based automatic ship detection method within the framework of compositional learning. The proposed method is supported by three pillars: context-guided region proposal, prototype-based model-pretraining, and multi-model ensemble learning. To reduce the false alarms induced by the discrete ground clutters, the prior knowledge of the harbour's layout is exploited to generate land masks for terrain delimitation.
View Article and Find Full Text PDFEntropy (Basel)
July 2025
School of Artificial Intelligence, Xidian University, Xi'an 710071, China.
Leveraging the ability of Vision Transformers (ViTs) to model contextual information across spatial patches, Masked Image Modeling (MIM) has emerged as a successful pre-training paradigm for visual representation learning by masking parts of the input and reconstructing the original image. However, this characteristic of ViTs has led many existing MIM methods to focus primarily on spatial patch reconstruction, overlooking the importance of semantic continuity in the channel dimension. Therefore, we propose a novel Masked Channel Modeling (MCM) pre-training paradigm, which reconstructs masked channel features using the contextual information from unmasked channels, thereby enhancing the model's understanding of images from the perspective of channel semantic continuity.
View Article and Find Full Text PDF