In hashing-based long-tailed image retrieval, the dominance of data-rich head classes often hinders the learning of effective hash codes for data-poor tail classes due to inherent long-tailed bias. Interestingly, this bias also contains valuable prior knowledge by revealing inter-class dependencies, which can be beneficial for hash learning. However, previous methods have not thoroughly analyzed this tangled negative and positive effects of long-tailed bias from a causal inference perspective.
View Article and Find Full Text PDFIEEE Trans Image Process
January 2025
Edge sensor devices generate vast amounts of user data, but centralized processing poses privacy risks. Federated Learning addresses this by decentralizing training. However, applying Federated Learning directly to skeleton videos fails to preserve motion dynamics and suffers from client heterogeneity bias.
View Article and Find Full Text PDFFront Bioeng Biotechnol
March 2025
Mannitol is a valuable sugar alcohol, extensively used across various industries. Cyanobacteria show potential as future platforms for mannitol production, utilizing CO and solar energy directly. The proof-of-concept has been demonstrated by introducing a two-step pathway in cyanobacteria, converting fructose-6-phosphate to mannitol-1-phosphate and sequentially to mannitol.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
June 2025
Multi-modal learning aims to enhance performance by unifying models from various modalities but often faces the "modality imbalance" problem in real data, leading to a bias towards dominant modalities and neglecting others, thereby limiting its overall effectiveness. To address this challenge, the core idea is to balance the optimization of each modality to achieve a joint optimum. Existing approaches often employ a modal-level control mechanism for adjusting the update of each modal parameter.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
July 2025
How to effectively explore spatial and temporal information is important for video deblurring. In contrast to existing methods that directly align adjacent frames without discrimination, we develop a deep discriminative spatial and temporal network to facilitate the spatial and temporal feature exploration for better video deblurring. We first develop a channel-wise gated dynamic network to adaptively explore the spatial information.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
July 2025
Recent years have witnessed significant advances in image deraining due to the progress of effective image priors and deep learning models. As each deraining approach has individual settings (e.g.
View Article and Find Full Text PDFEffective visual representation is crucial for image captioning task. Among the existing methods, the grid-based visual encoding methods take fragmented features extracted from the entire image as input, lacking the fine-grained semantic information focused on salient objects. To address this issue, we propose an effective method, namely Multi-Level Semantic-Aware Transformer (MLSAT) for image captioning, to simultaneously focus on contextual details and high-level semantic information centered on salient objects.
View Article and Find Full Text PDFIn this paper, we propose a novel visual relation detection task, named Group Visual Relation Detection (GVRD), for detecting visual relations whose subjects and/or objects are groups (GVRs), inspired by the observation that groups are common in image semantic representation. GVRD can be deemed as an evolution over the existing visual relation detection task that limits both subjects and objects of visual relations as individuals. We propose a Simultaneous Group Relation Prediction (SGRP) method that can simultaneously predict groups and predicates to address GVRD.
View Article and Find Full Text PDFIEEE Trans Med Imaging
May 2025
Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB and Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities, thereby leading to suboptimal performance in complex scenarios.
View Article and Find Full Text PDFIEEE Trans Image Process
December 2024
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others.
View Article and Find Full Text PDFConcrete is the most widely used and highest-volume basic material in the word today. Enhancing its toughness, including tensile strength and deformation resistance, can boost the structural load-bearing capacity, minimize cracking, and decrease the amount of concrete and steel required in engineering projects. These advancements are crucial for the safety, durability, energy efficiency, and emission reduction of structural engineering.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
February 2025
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multimodal understanding and generation. Unlike widely-studied vision-language pretraining models, VALOR jointly models the relationships among vision, audio, and language in an end-to-end manner. It consists of three separate encoders for single modality representations and a decoder for multimodal conditional text generation.
View Article and Find Full Text PDFIEEE Trans Image Process
August 2024
Fine-grained visual classification aims to classify similar sub-categories with the challenges of large variations within the same sub-category and high visual similarities between different sub-categories. Recently, methods that extract semantic parts of the discriminative regions have attracted increasing attention. However, most existing methods extract the part features via rectangular bounding boxes by object detection module or attention mechanism, which makes it difficult to capture the rich shape information of objects.
View Article and Find Full Text PDFBiological materials relying on hierarchically ordered architectures inspire the emergence of advanced composites with mutually exclusive mechanical properties, but the efficient topology optimization and large-scale manufacturing remain challenging. Herein, this work proposes a scalable bottom-up approach to fabricate a novel nacre-like cement-resin composite with gradient brick-and-mortar (BM) structure, and demonstrates a machine learning-assisted method to optimize the gradient structure. The fabricated gradient composite exhibits an extraordinary combination of high flexural strength, toughness, and impact resistance.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
April 2025
Knowledge distillation-based anomaly detection (KDAD) methods rely on the teacher-student paradigm to detect and segment anomalous regions by contrasting the unique features extracted by both networks. However, existing KDAD methods suffer from two main limitations: 1) the student network can effortlessly replicate the teacher network's representations and 2) the features of the teacher network serve solely as a "reference standard" and are not fully leveraged. Toward this end, we depart from the established paradigm and instead propose an innovative approach called asymmetric distillation postsegmentation (ADPS).
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2024
How to effectively explore the colors of exemplars and propagate them to colorize each frame is vital for exemplar-based video colorization. In this article, we present a BiSTNet to explore colors of exemplars and utilize them to help video colorization by a bidirectional temporal feature fusion with the guidance of semantic image prior. We first establish the semantic correspondence between each frame and the exemplars in deep feature space to explore color information from exemplars.
View Article and Find Full Text PDFIEEE Trans Image Process
February 2024
The image-level label has prevailed in weakly supervised semantic segmentation tasks due to its easy availability. Since image-level labels can only indicate the existence or absence of specific categories of objects, visualization-based techniques have been widely adopted to provide object location clues. Considering class activation maps (CAMs) can only locate the most discriminative part of objects, recent approaches usually adopt an expansion strategy to enlarge the activation area for more integral object localization.
View Article and Find Full Text PDFIEEE Trans Image Process
December 2023
Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of "action-objects" pairs between the training and testing sets. Previous works for CAR usually introduce extra information (e.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
May 2024
Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
April 2024
Stereo matching is a fundamental building block for many vision and robotics applications. An informative and concise cost volume representation is vital for stereo matching of high accuracy and efficiency. In this article, we present a novel cost volume construction method, named attention concatenation volume (ACV), which generates attention weights from correlation clues to suppress redundant information and enhance matching-related information in the concatenation volume.
View Article and Find Full Text PDFMaterials (Basel)
November 2023
Alite dissolution plays a crucial role in cement hydration. However, quantitative investigations into alite powder dissolution are limited, especially regarding the influence of chemical admixtures. This study investigates the impact of particle size, temperature, saturation level, and mixing speed on alite powder dissolution rate, considering the real-time evolution of specific surface area during the alite powder dissolution process.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
January 2025
This article proposes a new hashing framework named relational consistency induced self-supervised hashing (RCSH) for large-scale image retrieval. To capture the potential semantic structure of data, RCSH explores the relational consistency between data samples in different spaces, which learns reliable data relationships in the latent feature space and then preserves the learned relationships in the Hamming space. The data relationships are uncovered by learning a set of prototypes that group similar data samples in the latent feature space.
View Article and Find Full Text PDFIEEE Trans Image Process
November 2023
Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation.
View Article and Find Full Text PDF