Publications by authors named "Jinhui Tang"

In hashing-based long-tailed image retrieval, the dominance of data-rich head classes often hinders the learning of effective hash codes for data-poor tail classes due to inherent long-tailed bias. Interestingly, this bias also contains valuable prior knowledge by revealing inter-class dependencies, which can be beneficial for hash learning. However, previous methods have not thoroughly analyzed this tangled negative and positive effects of long-tailed bias from a causal inference perspective.

View Article and Find Full Text PDF

Edge sensor devices generate vast amounts of user data, but centralized processing poses privacy risks. Federated Learning addresses this by decentralizing training. However, applying Federated Learning directly to skeleton videos fails to preserve motion dynamics and suffers from client heterogeneity bias.

View Article and Find Full Text PDF

Mannitol is a valuable sugar alcohol, extensively used across various industries. Cyanobacteria show potential as future platforms for mannitol production, utilizing CO and solar energy directly. The proof-of-concept has been demonstrated by introducing a two-step pathway in cyanobacteria, converting fructose-6-phosphate to mannitol-1-phosphate and sequentially to mannitol.

View Article and Find Full Text PDF

Multi-modal learning aims to enhance performance by unifying models from various modalities but often faces the "modality imbalance" problem in real data, leading to a bias towards dominant modalities and neglecting others, thereby limiting its overall effectiveness. To address this challenge, the core idea is to balance the optimization of each modality to achieve a joint optimum. Existing approaches often employ a modal-level control mechanism for adjusting the update of each modal parameter.

View Article and Find Full Text PDF

How to effectively explore spatial and temporal information is important for video deblurring. In contrast to existing methods that directly align adjacent frames without discrimination, we develop a deep discriminative spatial and temporal network to facilitate the spatial and temporal feature exploration for better video deblurring. We first develop a channel-wise gated dynamic network to adaptively explore the spatial information.

View Article and Find Full Text PDF

Recent years have witnessed significant advances in image deraining due to the progress of effective image priors and deep learning models. As each deraining approach has individual settings (e.g.

View Article and Find Full Text PDF

Effective visual representation is crucial for image captioning task. Among the existing methods, the grid-based visual encoding methods take fragmented features extracted from the entire image as input, lacking the fine-grained semantic information focused on salient objects. To address this issue, we propose an effective method, namely Multi-Level Semantic-Aware Transformer (MLSAT) for image captioning, to simultaneously focus on contextual details and high-level semantic information centered on salient objects.

View Article and Find Full Text PDF

In this paper, we propose a novel visual relation detection task, named Group Visual Relation Detection (GVRD), for detecting visual relations whose subjects and/or objects are groups (GVRs), inspired by the observation that groups are common in image semantic representation. GVRD can be deemed as an evolution over the existing visual relation detection task that limits both subjects and objects of visual relations as individuals. We propose a Simultaneous Group Relation Prediction (SGRP) method that can simultaneously predict groups and predicates to address GVRD.

View Article and Find Full Text PDF

Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information.

View Article and Find Full Text PDF

RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB and Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities, thereby leading to suboptimal performance in complex scenarios.

View Article and Find Full Text PDF

Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others.

View Article and Find Full Text PDF

Concrete is the most widely used and highest-volume basic material in the word today. Enhancing its toughness, including tensile strength and deformation resistance, can boost the structural load-bearing capacity, minimize cracking, and decrease the amount of concrete and steel required in engineering projects. These advancements are crucial for the safety, durability, energy efficiency, and emission reduction of structural engineering.

View Article and Find Full Text PDF

In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multimodal understanding and generation. Unlike widely-studied vision-language pretraining models, VALOR jointly models the relationships among vision, audio, and language in an end-to-end manner. It consists of three separate encoders for single modality representations and a decoder for multimodal conditional text generation.

View Article and Find Full Text PDF

Fine-grained visual classification aims to classify similar sub-categories with the challenges of large variations within the same sub-category and high visual similarities between different sub-categories. Recently, methods that extract semantic parts of the discriminative regions have attracted increasing attention. However, most existing methods extract the part features via rectangular bounding boxes by object detection module or attention mechanism, which makes it difficult to capture the rich shape information of objects.

View Article and Find Full Text PDF

Biological materials relying on hierarchically ordered architectures inspire the emergence of advanced composites with mutually exclusive mechanical properties, but the efficient topology optimization and large-scale manufacturing remain challenging. Herein, this work proposes a scalable bottom-up approach to fabricate a novel nacre-like cement-resin composite with gradient brick-and-mortar (BM) structure, and demonstrates a machine learning-assisted method to optimize the gradient structure. The fabricated gradient composite exhibits an extraordinary combination of high flexural strength, toughness, and impact resistance.

View Article and Find Full Text PDF

Knowledge distillation-based anomaly detection (KDAD) methods rely on the teacher-student paradigm to detect and segment anomalous regions by contrasting the unique features extracted by both networks. However, existing KDAD methods suffer from two main limitations: 1) the student network can effortlessly replicate the teacher network's representations and 2) the features of the teacher network serve solely as a "reference standard" and are not fully leveraged. Toward this end, we depart from the established paradigm and instead propose an innovative approach called asymmetric distillation postsegmentation (ADPS).

View Article and Find Full Text PDF
Article Synopsis
  • A rare case of diffuse large B-cell lymphoma (DLBCL) involved both the peripheral and central nervous systems, confirmed through pathology with early symptoms of dyspnoea and hyperventilation.
  • The patient experienced fatigue, limb pain, and worsening breathlessness, leading to ventilator support after ineffective initial treatment for what was suspected to be Guillain-Barre syndrome.
  • Diagnosis was complicated due to non-specific early signs, but after chemotherapy, the patient improved briefly before sadly passing away from pneumonia, emphasizing the poor prognosis linked to nervous system involvement in DLBCL.
View Article and Find Full Text PDF

How to effectively explore the colors of exemplars and propagate them to colorize each frame is vital for exemplar-based video colorization. In this article, we present a BiSTNet to explore colors of exemplars and utilize them to help video colorization by a bidirectional temporal feature fusion with the guidance of semantic image prior. We first establish the semantic correspondence between each frame and the exemplars in deep feature space to explore color information from exemplars.

View Article and Find Full Text PDF

The image-level label has prevailed in weakly supervised semantic segmentation tasks due to its easy availability. Since image-level labels can only indicate the existence or absence of specific categories of objects, visualization-based techniques have been widely adopted to provide object location clues. Considering class activation maps (CAMs) can only locate the most discriminative part of objects, recent approaches usually adopt an expansion strategy to enlarge the activation area for more integral object localization.

View Article and Find Full Text PDF

Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of "action-objects" pairs between the training and testing sets. Previous works for CAR usually introduce extra information (e.

View Article and Find Full Text PDF

Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information.

View Article and Find Full Text PDF

Stereo matching is a fundamental building block for many vision and robotics applications. An informative and concise cost volume representation is vital for stereo matching of high accuracy and efficiency. In this article, we present a novel cost volume construction method, named attention concatenation volume (ACV), which generates attention weights from correlation clues to suppress redundant information and enhance matching-related information in the concatenation volume.

View Article and Find Full Text PDF

Alite dissolution plays a crucial role in cement hydration. However, quantitative investigations into alite powder dissolution are limited, especially regarding the influence of chemical admixtures. This study investigates the impact of particle size, temperature, saturation level, and mixing speed on alite powder dissolution rate, considering the real-time evolution of specific surface area during the alite powder dissolution process.

View Article and Find Full Text PDF

This article proposes a new hashing framework named relational consistency induced self-supervised hashing (RCSH) for large-scale image retrieval. To capture the potential semantic structure of data, RCSH explores the relational consistency between data samples in different spaces, which learns reliable data relationships in the latent feature space and then preserves the learned relationships in the Hamming space. The data relationships are uncovered by learning a set of prototypes that group similar data samples in the latent feature space.

View Article and Find Full Text PDF

Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation.

View Article and Find Full Text PDF