Publications by Ming-Hsuan Yang | LitMetric

Publications by authors named "Ming-Hsuan Yang"

Page 1 of 5

TripleNet: Exploiting Complementary Features and Pseudo-Labels for Semi-Supervised Salient Object Detection.

Liyuan Chen , Ming-Hsuan Yang , Jian Pu , Zhonglong Zheng

IEEE Trans Image Process

August 2025

Due to the limited output categories, semi-supervised salient object detection faces challenges in adapting conventional semi-supervised strategies. To address this limitation, we propose a multi-branch architecture that extracts complementary features from labeled data. Specifically, we introduce TripleNet, a three-branch network architecture designed for contour, content, and holistic saliency prediction.

View Article and Find Full Text PDF

Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising.

Liang Li , Gaoxiang Cong , Yuankai Qi , Zheng-Jun Zha , Qi Wu , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

August 2025

Given a piece of text, a video clip, and reference audio, the movie dubbing (also known as Visual Voice Cloning, V2C) task aims to generate speeches that clone reference voice and align well with the video in both emotion and lip movement, which is more challenging than conventional text-to-speech synthesis tasks. To align the generated speech with the inherent lip motion of the given silent video, most existing works utilize each video frame to query textual phonemes. However, such an attention operation usually leads to mumble speech because different phonemes are fused for video frames corresponding to one phoneme (video frames are finer-grained than phonemes).

View Article and Find Full Text PDF

Re-Boosting Self-Collaboration Parallel Prompt GAN for Unsupervised Image Restoration.

Xin Lin , Yuyan Zhou , Jingtong Yue , Chao Ren , Kelvin C K Chan , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

July 2025

Deep learning methods have demonstrated state-of-the-art performance in image restoration, especially when trained on large-scale paired datasets. However, acquiring paired data in real-world scenarios poses a significant challenge. Unsupervised restoration approaches based on generative adversarial networks (GANs) offer a promising solution without requiring paired datasets.

View Article and Find Full Text PDF

Rethinking Evaluation Metrics of Open-Vocabulary Segmentation.

Hao Zhou , Lu Qi , Tiancheng Shen , Hai Huang , Xu Yang , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

August 2025

This paper highlights a problem of evaluation metrics adopted in the open-vocabulary segmentation. The evaluation process relies heavily on closed-set metrics on zero-shot or cross-dataset pipelines without considering the similarity between predicted and ground truth categories. We first survey eleven similarity measurements between two categorical words using WordNet linguistics statistics, text embedding, or language models by comprehensive quantitative analysis and user study to tackle this issue.

View Article and Find Full Text PDF

Dynamic Erasing Network With Adaptive Temporal Modeling for Weakly Supervised Video Anomaly Detection.

Chen Zhang , Guorong Li , Yuankai Qi , Hanhua Ye , Laiyun Qing , Ming-Hsuan Yang

IEEE Trans Neural Netw Learn Syst

September 2025

The weakly supervised video anomaly detection aims to learn a detection model using only video-level labeled data. Prior studies ignore the complexity or duration of anomalies present in abnormal videos during temporal modeling. Moreover, existing works usually detect the most abnormal segments, potentially overlooking the completeness of anomalies.

View Article and Find Full Text PDF

One-for-All: Towards Universal Domain Translation With a Single StyleGAN.

Yong Du , Jiahui Zhan , Xinzhe Li , Junyu Dong , Sheng Chen , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

April 2025

In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space.

View Article and Find Full Text PDF

Foundation Models Defining a New Era in Vision: A Survey and Outlook.

Muhammad Awais , Muzammal Naseer , Salman Khan , Rao Muhammad Anwer , Hisham Cholakkal , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

April 2025

Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.

View Article and Find Full Text PDF

V2X-ViTv2: Improved Vision Transformers for Vehicle-to-Everything Cooperative Perception.

Runsheng Xu , Chia-Ju Chen , Zhengzhong Tu , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

January 2025

In this paper, we study the application of Vehicle-to-Everything (V2X) communication to improve the perception performance of autonomous vehicles. We present V2X-ViTs, a robust cooperative perception framework with V2X communication using novel vision Transformer models. First, we present V2X-ViTv1 containing holistic attention modules that can effectively fuse information across on-road agents (i.

View Article and Find Full Text PDF

Panoptic-PartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation.

Xiangtai Li , Shilin Xu , Yibo Yang , Haobo Yuan , Guangliang Cheng , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

December 2024

Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer.

View Article and Find Full Text PDF

Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation.

Yue Han , Jiangning Zhang , Yabiao Wang , Chengjie Wang , Yong Liu , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

December 2024

Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) overfitting suppresses novel class objects and 2) dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks.

View Article and Find Full Text PDF

Understanding Whitening Loss in Self-Supervised Learning.

Lei Huang , Yunhao Ni , Xi Weng , Rao Muhammad Anwer , Salman Khan , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

December 2024

A desirable objective in self-supervised learning (SSL) is to avoid feature collapse. Whitening loss guarantees collapse avoidance by minimizing the distance between embeddings of positive pairs under the conditioning that the embeddings from different views are whitened. In this paper, we propose a framework with an informative indicator to analyze whitening loss, which provides a clue to demystify several interesting phenomena and a pivoting point connecting to other SSL methods.

View Article and Find Full Text PDF

Learning Disentangled Representation for One-Shot Progressive Face Swapping.

Qi Li , Weining Wang , Chengzhong Xu , Zhenan Sun , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

December 2024

Article Synopsis

Face swapping is a complex challenge that often relies on extensive data samples, but existing methods neglect semantic information, resulting in fixed identity representation and ineffective swaps.
The paper introduces FaceSwapper, an efficient one-shot face swapping technique utilizing Generative Adversarial Networks (GANs), featuring a disentangled representation module for identity and attribute separation.
FaceSwapper enhances the swapping process by incorporating semantic information through a semantic-guided fusion module, leading to superior accuracy in pose and expression while achieving state-of-the-art results with fewer training samples.

View Article and Find Full Text PDF

UNETR++: Delving Into Efficient and Accurate 3D Medical Image Segmentation.

Abdelrahman Shaker , Muhammad Maaz , Hanoona Rasheed , Salman Khan , Ming-Hsuan Yang

IEEE Trans Med Imaging

September 2024

Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies, compared to the local convolutional-based design. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices.

View Article and Find Full Text PDF

Consistency-Aware Anchor Pyramid Network for Crowd Localization.

Xinyan Liu , Guorong Li , Yuankai Qi , Zhenjun Han , Anton van den Hengel , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

April 2024

Crowd localization aims to predict the positions of humans in images of crowded scenes. While existing methods have made significant progress, two primary challenges remain: (i) a fixed number of evenly distributed anchors can cause excessive or insufficient predictions across regions in an image with varying crowd densities, and (ii) ranking inconsistency of predictions between the testing and training phases leads to the model being sub-optimal in inference. To address these issues, we propose a Consistency-Aware Anchor Pyramid Network (CAAPN) comprising two key components: an Adaptive Anchor Generator (AAG) and a Localizer with Augmented Matching (LAM).

View Article and Find Full Text PDF

Correcting Optical Aberration via Depth-Aware Point Spread Functions.

Jun Luo , Yunfeng Nie , Wenqi Ren , Xiaochun Cao , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

August 2024

Optical aberration is a ubiquitous degeneration in realistic lens-based imaging systems. Optical aberrations are caused by the differences in the optical path length when light travels through different regions of the camera lens with different incident angles. The blur and chromatic aberrations manifest significant discrepancies when the optical system changes.

View Article and Find Full Text PDF

Automatically Discovering Novel Visual Categories With Adaptive Prototype Learning.

Lu Zhang , Lu Qi , Xu Yang , Hong Qiao , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

April 2024

This article targets the task of novel category discovery (NCD), which aims to discover unknown categories when a certain number of classes are already known. The NCD task is challenging due to its closeness to real-world scenarios, where we have only encountered some partial classes and corresponding images. Unlike previous approaches to NCD, we propose a novel adaptive prototype learning method that leverages prototypes to emphasize category discrimination and alleviate the issue of missing annotations for novel classes.

View Article and Find Full Text PDF

Learning Hierarchical Modular Networks for Video Captioning.

Guorong Li , Hanhua Ye , Yuankai Qi , Shuhui Wang , Laiyun Qing , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

February 2024

Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions.

View Article and Find Full Text PDF

GAN-Based Facial Attribute Manipulation.

Yunfan Liu , Qi Li , Qiyao Deng , Zhenan Sun , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

December 2023

Facial Attribute Manipulation (FAM) aims to aesthetically modify a given face image to render desired attributes, which has received significant attention due to its broad practical applications ranging from digital entertainment to biometric forensics. In the last decade, with the remarkable success of Generative Adversarial Networks (GANs) in synthesizing realistic images, numerous GAN-based models have been proposed to solve FAM with various problem formulation approaches and guiding information representations. This paper presents a comprehensive survey of GAN-based FAM methods with a focus on summarizing their principal motivations and technical details.

View Article and Find Full Text PDF

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-Local Spatial-Temporal Similarity.

Jinshan Pan , Boming Xu , Haoran Bai , Jinhui Tang , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

August 2023

We present compact and effective deep convolutional neural networks (CNNs) by exploring properties of videos for video deblurring. Motivated by the non-uniform blur property that not all the pixels of the frames are blurry, we develop a CNN to integrate a temporal sharpness prior (TSP) for removing blur in videos. The TSP exploits sharp pixels from adjacent frames to facilitate the CNN for better frame restoration.

View Article and Find Full Text PDF

GAN Inversion: A Survey.

Weihao Xia , Yulun Zhang , Yujiu Yang , Jing-Hao Xue , Bolei Zhou , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

March 2023

GAN inversion aims to invert a given image back into the latent space of a pretrained GAN model so that the image can be faithfully reconstructed from the inverted code by the generator. As an emerging technique to bridge the real and fake image domains, GAN inversion plays an essential role in enabling pretrained GAN models, such as StyleGAN and BigGAN, for applications of real image editing. Moreover, GAN inversion interprets GAN's latent space and examines how realistic images can be generated.

View Article and Find Full Text PDF

Self-Supervised Tracking via Target-Aware Data Synthesis.

Xin Li , Wenjie Pei , Yaowei Wang , Zhenyu He , Huchuan Lu , Ming-Hsuan Yang

IEEE Trans Neural Netw Learn Syst

July 2024

While deep-learning-based tracking methods have achieved substantial progress, they entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised (SS) learning for visual tracking. In this work, we develop the crop-transform-paste operation, which is able to synthesize sufficient training data by simulating various appearance variations during tracking, including appearance variations of objects and background interference.

View Article and Find Full Text PDF

ScoreMix: A Scalable Augmentation Strategy for Training GANs With Limited Data.

Jie Cao , Mandi Luo , Junchi Yu , Ming-Hsuan Yang , Ran He

IEEE Trans Pattern Anal Mach Intell

July 2023

Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available. To facilitate GAN training, current methods propose to use data-specific augmentation techniques. Despite the effectiveness, it is difficult for these methods to scale to practical applications.

View Article and Find Full Text PDF

Large-Scale Unsupervised Semantic Segmentation.

Shanghua Gao , Zhong-Yu Li , Ming-Hsuan Yang , Ming-Ming Cheng , Junwei Han

IEEE Trans Pattern Anal Mach Intell

June 2023

Empowered by large datasets, e.g., ImageNet and MS COCO, unsupervised learning on large-scale data has enabled significant advances for classification tasks.

View Article and Find Full Text PDF

Learning Enriched Features for Fast Image Restoration and Enhancement.

Syed Waqas Zamir , Aditya Arora , Salman Khan , Munawar Hayat , Fahad Shahbaz Khan , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

February 2023

Given a degraded input image, image restoration aims to recover the missing high-quality image content. Numerous applications demand effective image restoration, e.g.

View Article and Find Full Text PDF

Shaping Deep Feature Space Towards Gaussian Mixture for Visual Classification.

Weitao Wan , Cheng Yu , Jiansheng Chen , Tong Wu , Yuanyi Zhong , Ming-Hsuan Yang

IEEE Trans Pattern Anal Mach Intell

February 2023

The softmax cross-entropy loss function has been widely used to train deep models for various tasks. In this work, we propose a Gaussian mixture (GM) loss function for deep neural networks for visual classification. Unlike the softmax cross-entropy loss, our method explicitly shapes the deep feature space towards a Gaussian Mixture distribution.

View Article and Find Full Text PDF