98%
921
2 minutes
20
A holistic understanding of dynamic scenes is of fundamental importance in real-world computer vision problems such as autonomous driving, augmented reality and spatio-temporal reasoning. In this paper, we propose a new computer vision benchmark: Video Panoptic Segmentation (VPS). To study this important problem, we present two datasets, Cityscapes-VPS and VIPER together with a new evaluation metric, video panoptic quality (VPQ). We also propose VPSNet++, an advanced video panoptic segmentation network, which simultaneously performs classification, detection, segmentation, and tracking of all identities in videos. Specifically, VPSNet++ builds upon a top-down panoptic segmentation network by adding pixel-level feature fusion head and object-level association head. The former temporally augments the pixel features while the latter performs object tracking. Furthermore, we propose panoptic boundary learning as an auxiliary task, and instance discrimination learning which learns spatio-temporally clustered pixel embedding for individual thing or stuff regions, i.e., exactly the objective of the video panoptic segmentation problem. Our VPSNet++ significantly outperforms the default VPSNet, i.e., FuseTrack baseline, and achieves state-of-the-art results on both Cityscapes-VPS and VIPER datasets. The datasets, metric, and models are publicly available at https://github.com/mcahny/vps.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2022.3183440 | DOI Listing |
In this paper, we present UISE, a unified image segmentation framework that achieves efficient performance across various segmentation tasks, eliminating the need for multiple specialized pipelines. UISE employs dynamic convolutions between universal segmentation kernels and image feature maps, enabling a single pipeline for different tasks such as panoptic, instance, semantic, and video instance segmentation. To address computational requirements, we introduce a feature pyramid aggregator for image feature extraction and a separable dynamic decoder for generating segmentation kernels.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
July 2025
We present the Decoupled VIdeo Segmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
To date, the widely adopted way to perform fixation collection in panoptic video is based on a head-mounted display (HMD), where users' fixations are collected while wearing a HMD to explore the given panoptic scene freely. However, this widely-used data collection method is insufficient for training deep models to accurately predict which regions in a given panoptic are most important when it contains intermittent salient events. The main reason is that there always exist "blind zooms" when using HMD to collect fixations since the users cannot keep spinning their heads to explore the entire panoptic scene all the time.
View Article and Find Full Text PDFSensors (Basel)
September 2024
Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai 200240, China.
Collaboration among road agents, such as connected autonomous vehicles and roadside units, enhances driving performance by enabling the exchange of valuable information. However, existing collaboration methods predominantly focus on perception tasks and rely on single-frame static information sharing, which limits the effective exchange of temporal data and hinders broader applications of collaboration. To address this challenge, we propose CoPnP, a novel collaborative joint perception and prediction system, whose core innovation is to realize multi-frame spatial-temporal information sharing.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e.
View Article and Find Full Text PDF