Efficient High-Order Spatial Interactions for Visual Perception.

Zuyan Liu , Yongming Rao , Wenliang Zhao , Jie Zhou , Jiwen Lu

IEEE Trans Pattern Anal Mach Intell

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the proposed operation, we construct a new family of generic vision backbones for various visual modalities and tasks, including HorNet and HorFPN for image recognition, Hor3D for point cloud analysis, and HorCLIP for vision-language modeling. For image recognition, we propose HorNet as a stronger visual encoder, where we conduct extensive experiments on ImageNet classification, COCO object detection, and ADE20K semantic segmentation. HorNet outperforms Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and larger model sizes. Apart from image encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. For point cloud analysis, we design Hor3D, demonstrating the efficacy of high-order interactions for unstructured point cloud data through experiments on challenging 3D semantic segmentation tasks in S3DIS and ScanNet V2. In vision-language modeling, our proposed HorCLIP surpasses mainstream Vision Transformer and ConvNeXt architectures with shorter training schedules on ImageNet zero-shot classification and shows remarkably higher performance on vision-language dense representation tasks on COCO Panoptic datasets. Our results demonstrate that g nConv with high-order spatial interactions can be a new basic operation for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2025.3603181	DOI Listing

Publication Analysis

Top Keywords

high-order spatial

spatial interactions

vision transformers

point cloud

image recognition

cloud analysis

vision-language modeling

semantic segmentation

interactions

vision

A PHP Error was encountered