Publications by authors named "Lizhuang Ma"

Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP.

View Article and Find Full Text PDF

Generalization under distribution shifts has been a great challenge in computer vision. The prevailing practice of directly employing the one-hot labels as the training targets in domain generalization (DG) can lead to gradient conflicts, making it insufficient for capturing the intrinsic class characteristics and hard to increase the intra-class variation. Besides, existing methods in DG mostly overlook the distinct contributions of source (seen) domains, resulting in uneven learning from these domains.

View Article and Find Full Text PDF

Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution.

View Article and Find Full Text PDF

Real-world image super-resolution (RISR) has received increased focus for improving the quality of SR images under unknown complex degradation. Existing methods rely on the heavy SR models to enhance low-resolution (LR) images of different degradation levels, which significantly restricts their practical deployments on resource-limited devices. In this paper, we propose a novel Dynamic Channel Splitting scheme for efficient Real-world Image Super-Resolution, termed DCS-RISR.

View Article and Find Full Text PDF

Night-time scene parsing aims to extract pixel-level semantic information in night images, aiding downstream tasks in understanding scene object distribution. Due to limited labeled night image datasets, unsupervised domain adaptation (UDA) has become the predominant method for studying night scenes. UDA typically relies on paired day-night image pairs to guide adaptation, but this approach hampers dataset construction and restricts generalization across night scenes in different datasets.

View Article and Find Full Text PDF

Due to the unsatisfactory performance of supervised methods on unpaired real-world scans, point cloud completion via cross-domain adaptation has recently drawn growing attention. Nevertheless, previous approaches only focus on alleviating the distribution shift through domain alignment, resulting in massive information loss of real-world domain data. To tackle this issue, we propose a dual mixup-induced consistency regularization to integrate both source and target domain to improve robustness and generalization capability.

View Article and Find Full Text PDF

Information Bottleneck (IB) provides an information-theoretic principle for multi-view learning by revealing the various components contained in each viewpoint. This highlights the necessity to capture their distinct roles to achieve view-invariance and predictive representations but remains under-explored due to the technical intractability of modeling and organizing innumerable mutual information (MI) terms. Recent studies show that sufficiency and consistency play such key roles in multi-view representation learning, and could be preserved via a variational distillation framework.

View Article and Find Full Text PDF

In contrast to the traditional avatar creation pipeline which is a costly process, contemporary generative approaches directly learn the data distribution from photographs. While plenty of works extend unconditional generative models and achieve some levels of controllability, it is still challenging to ensure multi-view consistency, especially in large poses. In this work, we propose a network that generates 3D-aware portraits while being controllable according to semantic parameters regarding pose, identity, expression and illumination.

View Article and Find Full Text PDF

Hidden features in the neural networks usually fail to learn informative representation for 3D segmentation as supervisions are only given on output prediction, while this can be solved by omni-scale supervision on intermediate layers. In this paper, we bring the first omni-scale supervision method to 3D segmentation via the proposed gradual Receptive Field Component Reasoning (RFCR), where target Receptive Field Component Codes (RFCCs) is designed to record categories within receptive fields for hidden units in the encoder. Then, target RFCCs will supervise the decoder to gradually infer the RFCCs in a coarse-to-fine categories reasoning manner, and finally obtain the semantic labels.

View Article and Find Full Text PDF

Night-Time Scene Parsing (NTSP) is essential to many vision applications, especially for autonomous driving. Most of the existing methods are proposed for day-time scene parsing. They rely on modeling pixel intensity-based spatial contextual cues under even illumination.

View Article and Find Full Text PDF

Domain adaptation aims to bridge the domain shifts between the source and the target domain. These shifts may span different dimensions such as fog, rainfall, etc. However, recent methods typically do not consider explicit prior knowledge about the domain shifts on a specific dimension, thus leading to less desired adaptation performance.

View Article and Find Full Text PDF

U-Nets have achieved tremendous success in medical image segmentation. Nevertheless, it may have limitations in global (long-range) contextual interactions and edge-detail preservation. In contrast, the Transformer module has an excellent ability to capture long-range dependencies by leveraging the self-attention mechanism into the encoder.

View Article and Find Full Text PDF

Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures.

View Article and Find Full Text PDF

Coronavirus Disease 2019 (Covid-19) overtook the worldwide in early 2020, placing the world's health in threat. Automated lung infection detection using Chest X-ray images has a ton of potential for enhancing the traditional covid-19 treatment strategy. However, there are several challenges to detect infected regions from Chest X-ray images, including significant variance in infected features similar spatial characteristics, multi-scale variations in texture shapes and sizes of infected regions.

View Article and Find Full Text PDF

Mirror detection is challenging because the visual appearances of mirrors change depending on those of their surroundings. As existing mirror detection methods are mainly based on extracting contextual contrast and relational similarity between mirror and non-mirror regions, they may fail to identify a mirror region if these assumptions are violated. Inspired by a recent study of applying a CNN to help distinguish whether an image is flipped or not based on the visual chirality property, in this paper, we rethink this image-level visual chirality property and reformulate it as a learnable pixel level cue for mirror detection.

View Article and Find Full Text PDF

Point cloud analysis without pose priors is very challenging in real applications, as the orientations of point clouds are often unknown. In this paper, we propose a brand new point-set learning framework PRIN, namely, Point-wise Rotation Invariant Network, focusing on rotation invariant feature extraction in point clouds analysis. We construct spherical signals by Density Aware Adaptive Sampling to deal with distorted point distributions in spherical space.

View Article and Find Full Text PDF

Brain functional connectivity (FC) derived from resting-state functional magnetic resonance imaging (rs-fMRI) has been widely employed to study neuropsychiatric disorders such as autism spectrum disorder (ASD). Existing studies usually suffer from (1) significant data heterogeneity caused by different scanners or studied populations in multiple sites, (2) curse of dimensionality caused by millions of voxels in each fMRI scan and a very limited number (tens or hundreds) of training samples, and (3) poor interpretability, which hinders the identification of reproducible disease biomarkers. To this end, we propose a Multi-site Clustering and Nested Feature Extraction (MC-NFE) method for fMRI-based ASD detection.

View Article and Find Full Text PDF

Although huge progress has been made on scene analysis in recent years, most existing works assume the input images to be in day-time with good lighting conditions. In this work, we aim to address the night-time scene parsing (NTSP) problem, which has two main challenges: 1) labeled night-time data are scarce, and 2) over- and under-exposures may co-occur in the input night-time images and are not explicitly modeled in existing pipelines. To tackle the scarcity of night-time data, we collect a novel labeled dataset, named NightCity, of 4,297 real night-time images with ground truth pixel-level semantic annotations.

View Article and Find Full Text PDF

Facial expression transfer between two unpaired images is a challenging problem, as fine-grained expression is typically tangled with other facial attributes. Most existing methods treat expression transfer as an application of expression manipulation, and use predicted global expression, landmarks or action units (AUs) as a guidance. However, the prediction may be inaccurate, which limits the performance of transferring fine-grained expression.

View Article and Find Full Text PDF

Pixel-level 2D object semantic understanding is an important topic in computer vision and could help machine deeply understand objects (e.g., functionality and affordance) in our daily life.

View Article and Find Full Text PDF

We propose a robust normal estimation method for both point clouds and meshes using a low rank matrix approximation algorithm. First, we compute a local isotropic structure for each point and find its similar, non-local structures that we organize into a matrix. We then show that a low rank matrix approximation algorithm can robustly estimate normals for both point clouds and meshes.

View Article and Find Full Text PDF

In this article, we propose a multiview self-representation model for nonlinear subspaces clustering. By assuming that the heterogeneous features lie within the union of multiple linear subspaces, the recent multiview subspace learning methods aim to capture the complementary and consensus from multiple views to boost the performance. However, in real-world applications, data feature usually resides in multiple nonlinear subspaces, leading to undesirable results.

View Article and Find Full Text PDF

Noonan syndrome (NS) is a common autosomal dominant/recessive disorder. No large-scale study has been conducted on NS in China, which is the most populous country in the world. Next-generation sequencing (NGS) was used to identify pathogenic variants in patients that exhibited NS-related phenotypes.

View Article and Find Full Text PDF

Background: Human-computer interaction (HCI) is an important feature of augmented reality (AR) technology. The naturalness is the inevitable trend of HCI. Gesture is the most natural and frequently used body auxiliary interaction mode in daily interactions except for language.

View Article and Find Full Text PDF

In order to let the doctor carry on the coronary artery diagnosis and preoperative planning in a more intuitive and more natural way, and to improve the training effect for interns, an augmented reality system for coronary artery diagnosis planning and training (ARS-CADPT) is designed and realized in this paper. At first, a 3D reconstruction algorithm based on computed tomographic (CT) images is proposed to model the coronary artery vessels (CAV). Secondly, the algorithms of static gesture recognition and dynamic gesture spotting and recognition are presented to realize the real-time and friendly human-computer interaction (HCI), which is the characteristic of ARS-CADPT.

View Article and Find Full Text PDF