IEEE Trans Image Process
August 2025
Due to the limited output categories, semi-supervised salient object detection faces challenges in adapting conventional semi-supervised strategies. To address this limitation, we propose a multi-branch architecture that extracts complementary features from labeled data. Specifically, we introduce TripleNet, a three-branch network architecture designed for contour, content, and holistic saliency prediction.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2025
Given a piece of text, a video clip, and reference audio, the movie dubbing (also known as Visual Voice Cloning, V2C) task aims to generate speeches that clone reference voice and align well with the video in both emotion and lip movement, which is more challenging than conventional text-to-speech synthesis tasks. To align the generated speech with the inherent lip motion of the given silent video, most existing works utilize each video frame to query textual phonemes. However, such an attention operation usually leads to mumble speech because different phonemes are fused for video frames corresponding to one phoneme (video frames are finer-grained than phonemes).
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
July 2025
Deep learning methods have demonstrated state-of-the-art performance in image restoration, especially when trained on large-scale paired datasets. However, acquiring paired data in real-world scenarios poses a significant challenge. Unsupervised restoration approaches based on generative adversarial networks (GANs) offer a promising solution without requiring paired datasets.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2025
This paper highlights a problem of evaluation metrics adopted in the open-vocabulary segmentation. The evaluation process relies heavily on closed-set metrics on zero-shot or cross-dataset pipelines without considering the similarity between predicted and ground truth categories. We first survey eleven similarity measurements between two categorical words using WordNet linguistics statistics, text embedding, or language models by comprehensive quantitative analysis and user study to tackle this issue.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
September 2025
The weakly supervised video anomaly detection aims to learn a detection model using only video-level labeled data. Prior studies ignore the complexity or duration of anomalies present in abnormal videos during temporal modeling. Moreover, existing works usually detect the most abnormal segments, potentially overlooking the completeness of anomalies.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
April 2025
In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
April 2025
Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
January 2025
In this paper, we study the application of Vehicle-to-Everything (V2X) communication to improve the perception performance of autonomous vehicles. We present V2X-ViTs, a robust cooperative perception framework with V2X communication using novel vision Transformer models. First, we present V2X-ViTv1 containing holistic attention modules that can effectively fuse information across on-road agents (i.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) overfitting suppresses novel class objects and 2) dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
A desirable objective in self-supervised learning (SSL) is to avoid feature collapse. Whitening loss guarantees collapse avoidance by minimizing the distance between embeddings of positive pairs under the conditioning that the embeddings from different views are whitened. In this paper, we propose a framework with an informative indicator to analyze whitening loss, which provides a clue to demystify several interesting phenomena and a pivoting point connecting to other SSL methods.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
IEEE Trans Med Imaging
September 2024
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies, compared to the local convolutional-based design. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
April 2024
Crowd localization aims to predict the positions of humans in images of crowded scenes. While existing methods have made significant progress, two primary challenges remain: (i) a fixed number of evenly distributed anchors can cause excessive or insufficient predictions across regions in an image with varying crowd densities, and (ii) ranking inconsistency of predictions between the testing and training phases leads to the model being sub-optimal in inference. To address these issues, we propose a Consistency-Aware Anchor Pyramid Network (CAAPN) comprising two key components: an Adaptive Anchor Generator (AAG) and a Localizer with Augmented Matching (LAM).
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2024
Optical aberration is a ubiquitous degeneration in realistic lens-based imaging systems. Optical aberrations are caused by the differences in the optical path length when light travels through different regions of the camera lens with different incident angles. The blur and chromatic aberrations manifest significant discrepancies when the optical system changes.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
April 2024
This article targets the task of novel category discovery (NCD), which aims to discover unknown categories when a certain number of classes are already known. The NCD task is challenging due to its closeness to real-world scenarios, where we have only encountered some partial classes and corresponding images. Unlike previous approaches to NCD, we propose a novel adaptive prototype learning method that leverages prototypes to emphasize category discrimination and alleviate the issue of missing annotations for novel classes.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
February 2024
Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions.
View Article and Find Full Text PDFFacial Attribute Manipulation (FAM) aims to aesthetically modify a given face image to render desired attributes, which has received significant attention due to its broad practical applications ranging from digital entertainment to biometric forensics. In the last decade, with the remarkable success of Generative Adversarial Networks (GANs) in synthesizing realistic images, numerous GAN-based models have been proposed to solve FAM with various problem formulation approaches and guiding information representations. This paper presents a comprehensive survey of GAN-based FAM methods with a focus on summarizing their principal motivations and technical details.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2023
We present compact and effective deep convolutional neural networks (CNNs) by exploring properties of videos for video deblurring. Motivated by the non-uniform blur property that not all the pixels of the frames are blurry, we develop a CNN to integrate a temporal sharpness prior (TSP) for removing blur in videos. The TSP exploits sharp pixels from adjacent frames to facilitate the CNN for better frame restoration.
View Article and Find Full Text PDFGAN inversion aims to invert a given image back into the latent space of a pretrained GAN model so that the image can be faithfully reconstructed from the inverted code by the generator. As an emerging technique to bridge the real and fake image domains, GAN inversion plays an essential role in enabling pretrained GAN models, such as StyleGAN and BigGAN, for applications of real image editing. Moreover, GAN inversion interprets GAN's latent space and examines how realistic images can be generated.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
July 2024
While deep-learning-based tracking methods have achieved substantial progress, they entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised (SS) learning for visual tracking. In this work, we develop the crop-transform-paste operation, which is able to synthesize sufficient training data by simulating various appearance variations during tracking, including appearance variations of objects and background interference.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
July 2023
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available. To facilitate GAN training, current methods propose to use data-specific augmentation techniques. Despite the effectiveness, it is difficult for these methods to scale to practical applications.
View Article and Find Full Text PDFEmpowered by large datasets, e.g., ImageNet and MS COCO, unsupervised learning on large-scale data has enabled significant advances for classification tasks.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
February 2023
Given a degraded input image, image restoration aims to recover the missing high-quality image content. Numerous applications demand effective image restoration, e.g.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
February 2023
The softmax cross-entropy loss function has been widely used to train deep models for various tasks. In this work, we propose a Gaussian mixture (GM) loss function for deep neural networks for visual classification. Unlike the softmax cross-entropy loss, our method explicitly shapes the deep feature space towards a Gaussian Mixture distribution.
View Article and Find Full Text PDF