Dense Pixel-Level Interpretation of Dynamic Scenes With Video Panoptic Segmentation.

Dahun Kim , Sanghyun Woo , Joon-Young Lee , In So Kweon

IEEE Trans Image Process

Published: August 2022

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

A holistic understanding of dynamic scenes is of fundamental importance in real-world computer vision problems such as autonomous driving, augmented reality and spatio-temporal reasoning. In this paper, we propose a new computer vision benchmark: Video Panoptic Segmentation (VPS). To study this important problem, we present two datasets, Cityscapes-VPS and VIPER together with a new evaluation metric, video panoptic quality (VPQ). We also propose VPSNet++, an advanced video panoptic segmentation network, which simultaneously performs classification, detection, segmentation, and tracking of all identities in videos. Specifically, VPSNet++ builds upon a top-down panoptic segmentation network by adding pixel-level feature fusion head and object-level association head. The former temporally augments the pixel features while the latter performs object tracking. Furthermore, we propose panoptic boundary learning as an auxiliary task, and instance discrimination learning which learns spatio-temporally clustered pixel embedding for individual thing or stuff regions, i.e., exactly the objective of the video panoptic segmentation problem. Our VPSNet++ significantly outperforms the default VPSNet, i.e., FuseTrack baseline, and achieves state-of-the-art results on both Cityscapes-VPS and VIPER datasets. The datasets, metric, and models are publicly available at https://github.com/mcahny/vps.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2022.3183440	DOI Listing

Publication Analysis

Top Keywords

video panoptic

panoptic segmentation

dynamic scenes

computer vision

cityscapes-vps viper

segmentation network

panoptic

segmentation

video

dense pixel-level

Similar Publications

Universal Image Segmentation with Efficiency.

IEEE Trans Pattern Anal Mach Intell

June 2025

Jie Hu , Liujuan Cao , Xiaofeng Jin , Shengchuan Zhang , Rongrong Ji

In this paper, we present UISE, a unified image segmentation framework that achieves efficient performance across various segmentation tasks, eliminating the need for multiple specialized pipelines. UISE employs dynamic convolutions between universal segmentation kernels and image feature maps, enabling a single pipeline for different tasks such as panoptic, instance, semantic, and video instance segmentation. To address computational requirements, we introduce a feature pyramid aggregator for image feature extraction and a separable dynamic decoder for generating segmentation kernels.

View Article and Find Full Text PDF

Similar Publications

DVIS++: Improved Decoupled Framework for Universal Video Segmentation.

IEEE Trans Pattern Anal Mach Intell

July 2025

Tao Zhang , Xingye Tian , Yikang Zhou , Shunping Ji , Xuebo Wang

We present the Decoupled VIdeo Segmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos.

View Article and Find Full Text PDF

Similar Publications

WinDB: HMD-free and Distortion-free Panoptic Video Fixation Learning.

IEEE Trans Pattern Anal Mach Intell

December 2024

Guotao Wang , Chenglizhao Chen , Aimin Hao , Hong Qin , Deng-Ping Fan

To date, the widely adopted way to perform fixation collection in panoptic video is based on a head-mounted display (HMD), where users' fixations are collected while wearing a HMD to explore the given panoptic scene freely. However, this widely-used data collection method is insufficient for training deep models to accurately predict which regions in a given panoptic are most important when it contains intermittent salient events. The main reason is that there always exist "blind zooms" when using HMD to collect fixations since the users cannot keep spinning their heads to explore the entire panoptic scene all the time.

View Article and Find Full Text PDF

Similar Publications

Collaborative Joint Perception and Prediction for Autonomous Driving.

Sensors (Basel)

September 2024

Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai 200240, China.

Shunli Ren , Siheng Chen , Wenjun Zhang

Collaboration among road agents, such as connected autonomous vehicles and roadside units, enhances driving performance by enabling the exchange of valuable information. However, existing collaboration methods predominantly focus on perception tasks and rely on single-frame static information sharing, which limits the effective exchange of temporal data and hinders broader applications of collaboration. To address this challenge, we propose CoPnP, a novel collaborative joint perception and prediction system, whose core innovation is to realize multi-frame spatial-temporal information sharing.

View Article and Find Full Text PDF

Similar Publications

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future.

IEEE Trans Pattern Anal Mach Intell

December 2024

Chaoyang Zhu , Long Chen

As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e.

View Article and Find Full Text PDF

Similar Publications