MsSVT++: Mixed-Scale Sparse Voxel Transformer With Center Voting for 3D Object Detection.

Jianan Li , Shaocong Dong , Lihe Ding , Tingfa Xu

IEEE Trans Pattern Anal Mach Intell

Published: May 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Accurate 3D object detection in large-scale outdoor scenes, characterized by considerable variations in object scales, necessitates features rich in both long-range and fine-grained information. While recent detectors have utilized window-based transformers to model long-range dependencies, they tend to overlook fine-grained details. To bridge this gap, we propose MsSVT++, an innovative Mixed-scale Sparse Voxel Transformer that simultaneously captures both types of information through a divide-and-conquer approach. This approach involves explicitly dividing attention heads into multiple groups, each responsible for attending to information within a specific range. The outputs of these groups are subsequently merged to obtain final mixed-scale features. To mitigate the computational complexity associated with applying a window-based transformer in 3D voxel space, we introduce a novel Chessboard Sampling strategy and implement voxel sampling and gathering operations sparsely using a hash map. Moreover, an important challenge stems from the observation that non-empty voxels are primarily located on the surface of objects, which impedes the accurate estimation of bounding boxes. To overcome this challenge, we introduce a Center Voting module that integrates newly voted voxels enriched with mixed-scale contextual information towards the centers of the objects, thereby improving precise object localization. Extensive experiments demonstrate that our single-stage detector, built upon the foundation of MsSVT++, consistently delivers exceptional performance across diverse datasets.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2023.3345880	DOI Listing

Publication Analysis

Top Keywords

mixed-scale sparse

sparse voxel

voxel transformer

center voting

object detection

mssvt++ mixed-scale

voxel

transformer center

object

voting object

Similar Publications

Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution.

IEEE Trans Image Process

January 2025

Zhangkai Ni , Yang Zhang , Wenhan Yang , Hanli Wang , Shiqi Wang

Major efforts in data-driven image super-resolution (SR) primarily focus on expanding the receptive field of the model to better capture contextual information. However, these methods are typically implemented by stacking deeper networks or leveraging transformer-based attention mechanisms, which consequently increases model complexity. In contrast, model-driven methods based on the unfolding paradigm show promise in improving performance while effectively maintaining model compactness through sophisticated module design.

View Article and Find Full Text PDF

Similar Publications

Flare Removal Model Based on Sparse-UFormer Networks.

Entropy (Basel)

July 2024

School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 100044, China.

Siqi Wu , Fei Liu , Yu Bai , Houzeng Han , Jian Wang

When a camera lens is directly faced with a strong light source, image flare commonly occurs, significantly reducing the clarity and texture of the photo and interfering with image processing tasks that rely on visual sensors, such as image segmentation and feature extraction. A novel flare removal network, the Sparse-UFormer neural network, has been developed. The network integrates two core components onto the UFormer architecture: the mixed-scale feed-forward network (MSFN) and top-k sparse attention (TKSA), creating the sparse-transformer module.

View Article and Find Full Text PDF

Similar Publications

DLSIA: Deep Learning for Scientific Image Analysis.

J Appl Crystallogr

April 2024

Center for Advanced Mathematics for Energy Research Applications, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.

Eric J Roberts , Tanny Chavez , Alexander Hexemer , Petrus H Zwart

DLSIA (Deep Learning for Scientific Image Analysis) is a Python-based machine learning library that empowers scientists and researchers across diverse scientific domains with a range of customizable convolutional neural network (CNN) architectures for a wide variety of tasks in image analysis to be used in downstream data processing. DLSIA features easy-to-use architectures, such as autoencoders, tunable U-Nets and parameter-lean mixed-scale dense networks (MSDNets). Additionally, this article introduces sparse mixed-scale networks (SMSNets), generated using random graphs, sparse connections and dilated convolutions connecting different length scales.

View Article and Find Full Text PDF

Similar Publications

MsSVT++: Mixed-Scale Sparse Voxel Transformer With Center Voting for 3D Object Detection.

IEEE Trans Pattern Anal Mach Intell

May 2024

Jianan Li , Shaocong Dong , Lihe Ding , Tingfa Xu

View Article and Find Full Text PDF

Similar Publications

Towards understanding residual and dilated dense neural networks via convolutional sparse coding.

Natl Sci Rev

March 2021

National Center for Mathematics and Interdisciplinary Sciences, Center for Excellence in Mathematical Science, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.

Zhiyang Zhang , Shihua Zhang

Convolutional neural network (CNN) and its variants have led to many state-of-the-art results in various fields. However, a clear theoretical understanding of such networks is still lacking. Recently, a multilayer convolutional sparse coding (ML-CSC) model has been proposed and proved to equal such simply stacked networks (plain networks).

View Article and Find Full Text PDF

Similar Publications