Publications by authors named "Xiaochun Cao"

Image restoration aims to recover the latent clean image from a degraded counterpart. In general, the prevailing state-of-the-art image restoration methods concentrate on solving only a specific degradation type according to the task, e.g.

View Article and Find Full Text PDF

Medical phrase grounding is crucial for identifying relevant regions in medical images based on phrase queries, facilitating accurate image analysis and diagnosis. However, current methods rely on manual extraction of key phrases from medical reports, reducing efficiency and increasing the workload for clinicians. Additionally, the lack of model confidence estimation limits clinical trust and usability.

View Article and Find Full Text PDF

The transferability of adversarial examples is vital for black-box attacks, as it enables the adversary to deceive the target model without knowing its internals. Despite numerous methods focusing on transferability, they still struggle with transferring across models with distinct architectural components (e.g.

View Article and Find Full Text PDF

Multimodal Large Language Models (MLLMs) have gained significant attention due to their rich internal implicit knowledge for cross-modal learning. Although advances in bringing audio-visuals into LLMs have resulted in boosts for a variety of Audio-Visual Question Answering (AVQA) tasks, they still face two crucial challenges: 1) audio-visual ambiguity, and 2) audio-visual hallucination. Existing MLLMs can respond to audio-visual content, yet sometimes fail to describe specific objects due to the ambiguity or hallucination of responses.

View Article and Find Full Text PDF

Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable AEs, which succeed across unseen models, is key to developing more robust and practical VLP models. Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process, aiming to improve transferability by expanding the contrast space of image-text features.

View Article and Find Full Text PDF

Face Anti-Spoofing (FAS) is essential for securing face recognition systems against presentation attacks. Recent advances in sensor technology and multimodal learning have enabled the development of multimodal FAS systems. However, existing methods often struggle to generalize to unseen attacks and diverse environments due to two key challenges: (1) Modality unreliability, where sensors such as depth and infrared suffer from severe domain shifts, impairing the reliability of cross-modal fusion; and (2) Modality imbalance, where over-reliance on a dominant modality weakens the model's robustness against attacks that affect other modalities.

View Article and Find Full Text PDF

Transformer-based trackers have achieved promising success and become the dominant tracking paradigm because of their accuracy and efficiency. Despite the substantial progress, most of the existing approaches handle object tracking as a deterministic coordinate regression problem, while the target localization uncertainty has been largely overlooked, which hampers trackers' ability to maintain reliable target state prediction in challenging scenarios. To address this issue, we propose UncTrack, a novel uncertainty-aware transformer-based tracker that predicts the target localization uncertainty and incorporates this uncertainty information for accurate target state inference.

View Article and Find Full Text PDF

The objective of few-shot object detection (FSOD) is to detect novel objects with few training samples. The key challenge is constructing a generalized feature space for novel categories with limited data, leveraging the base category space to adapt the detection model. Most fine-tuning methods address this by pre-training on base categories and fine-tuning on novel ones.

View Article and Find Full Text PDF

Existing localization methods commonly employ vision to perceive scene and achieve localization in GNSS-denied areas, yet they often struggle in environments with complex lighting conditions, dynamic objects or privacy-preserving areas. Humans possess the ability to describe various scenes using natural language, effectively inferring their location by leveraging the rich semantic information in these descriptions. Harnessing language presents a potential solution for robust localization.

View Article and Find Full Text PDF

Label propagation (LP) is a popular semi-supervised learning technique that propagates labels from a training dataset to a test one using a similarity graph, assuming that nearby samples should have similar labels. However, the recent cross-domain problem assumes that training (source domain) and test data sets (target domain) follow different distributions, which may unexpectedly degrade the performance of LP due to small similarity weights connecting the two domains. To address this problem, we propose optimal graph learning-based label propagation (OGL2P), which optimizes one cross-domain graph and two intra-domain graphs to connect the two domains and preserve domain-specific structures, respectively.

View Article and Find Full Text PDF

The current studies of provable robustness for deep neural networks (DNNs) usually assume that the class distribution is overall balanced. However, in real-world applications especially for safety-sensitive systems, the class distribution often exhibits a long-tailed property. It is well-known that the Area Under the ROC Curve (AUC) is a more proper metric for long-tailed learning problems.

View Article and Find Full Text PDF

Currently, deep neural networks (DNNs) are widely adopted in different applications. Despite its commercial values, training a well-performing DNN is resource-consuming. Accordingly, the well-trained model is valuable intellectual property for its owner.

View Article and Find Full Text PDF

Domain adaptation aims to leverage abundant label information from a source domain to an unlabeled target domain with two different distributions. Existing methods usually rely on a classifier to generate high-quality pseudo-labels for the target domain, facilitating the learning of discriminative features. Label propagation (LP), as an effective classifier, propagates labels from the source domain to the target domain by designing a smooth function over a similarity graph, which represents structural relationships among data points in feature space.

View Article and Find Full Text PDF

Adversarial patch is one of the important forms of performing adversarial attacks in the physical world. To improve the naturalness and aggressiveness of existing adversarial patches, location-aware patches are proposed, where the patch's location on the target object is integrated into the optimization process to perform attacks. Although it is effective, efficiently finding the optimal location for placing the patches is challenging, especially under the black-box attack settings.

View Article and Find Full Text PDF

The accurate measurement of perceptual color differences (CDs) between two images plays an important role in modern smartphone photography. Although traditional CD metrics provide numerical scores to quantify color variations, they often lack the ability to offer intuitive insights or explanations that reflect the factors behind these differences in a way that aligns with human perception and reasoning. Here, we present CD-Reasoning, an innovative method designed not merely to compute numerical CD scores but also to provide a detailed rationale for the observed CDs between images.

View Article and Find Full Text PDF

Deep neural network (DNN) models are widely used in various fields, such as pattern recognition and natural language processing, and provide considerable commercial value to their owners. Embedding a digital watermark in the model allows the legitimate owner to detect unauthorized use of the model. However, the existing DNN watermarking methods are vulnerable to model extraction attacks since the watermark task and the original model task are independent.

View Article and Find Full Text PDF

Many Transformer-based pre-trained models for code have been developed and applied to code-related tasks. In this paper, we analyze 519 papers published on this topic during 2017-2023, examine the suitability of model architectures for different tasks, summarize their resource consumption, and look at the generalization ability of models on different datasets. We examine three representative pre-trained models for code: CodeBERT, CodeGPT, and CodeT5, and conduct experiments on the four topmost targeted software engineering tasks from the literature: Bug Fixing, Bug Detection, Code Summarization, and Code Search.

View Article and Find Full Text PDF

Scene text editing aims to replace the source text with the target text while preserving the original background. Its practical applications span various domains, such as data generation and privacy protection, highlighting its increasing importance in recent years. In this study, we propose a novel Scene Text Editing network with Explicitly-decoupled text transfer and Minimized background reconstruction, called STEEM.

View Article and Find Full Text PDF

Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for camouflaged object detection, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features.

View Article and Find Full Text PDF

Object detection methods have achieved remarkable performances when the training and testing data satisfy the assumption of i.i.d.

View Article and Find Full Text PDF

Inertial measurement units (IMU) in the capturing device can record the motion information of the device, with gyroscopes measuring angular velocity and accelerometers measuring acceleration. However, conventional deblurring methods seldom incorporate IMU data, and existing approaches that utilize IMU information often face challenges in fully leveraging this valuable data, resulting in noise issues from the sensors. To address these issues, in this paper, we propose a multi-stage deblurring network named INformer, which combines inertial information with the Transformer architecture.

View Article and Find Full Text PDF

Image restoration aims to reconstruct a high-quality image from its corrupted version, playing essential roles in many scenarios. Recent years have witnessed a paradigm shift in image restoration from convolutional neural networks (CNNs) to Transformer-based models due to their powerful ability to model long-range pixel interactions. In this paper, we explore the potential of CNNs for image restoration and show that the proposed simple convolutional network architecture, termed ConvIR, can perform on par with or better than the Transformer counterparts.

View Article and Find Full Text PDF

Rank aggregation with pairwise comparisons is widely encountered in sociology, politics, economics, psychology, sports, etc. Given the enormous social impact and the consequent incentives, the potential adversary has a strong motivation to manipulate the ranking list. However, the ideal attack opportunity and the excessive adversarial capability cause the existing methods to be impractical.

View Article and Find Full Text PDF

Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and collaborative filtering. Following the convention of RS, existing practices exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests.

View Article and Find Full Text PDF

A long-standing topic in artificial intelligence is the effective recognition of patterns from noisy images. In this regard, the recent data-driven paradigm considers 1) improving the representation robustness by adding noisy samples in training phase (i.e.

View Article and Find Full Text PDF