A PHP Error was encountered

Severity: Warning

Message: file_get_contents(https://...@gmail.com&api_key=61f08fa0b96a73de8c900d749fcb997acc09&a=1): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests

Filename: helpers/my_audit_helper.php

Line Number: 197

Backtrace:

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 197
Function: file_get_contents

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 271
Function: simplexml_load_file_from_url

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 3165
Function: getPubMedXML

File: /var/www/html/application/controllers/Detail.php
Line: 597
Function: pubMedSearch_Global

File: /var/www/html/application/controllers/Detail.php
Line: 511
Function: pubMedGetRelatedKeyword

File: /var/www/html/index.php
Line: 317
Function: require_once

Efficient High-Order Spatial Interactions for Visual Perception. | LitMetric

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the proposed operation, we construct a new family of generic vision backbones for various visual modalities and tasks, including HorNet and HorFPN for image recognition, Hor3D for point cloud analysis, and HorCLIP for vision-language modeling. For image recognition, we propose HorNet as a stronger visual encoder, where we conduct extensive experiments on ImageNet classification, COCO object detection, and ADE20K semantic segmentation. HorNet outperforms Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and larger model sizes. Apart from image encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. For point cloud analysis, we design Hor3D, demonstrating the efficacy of high-order interactions for unstructured point cloud data through experiments on challenging 3D semantic segmentation tasks in S3DIS and ScanNet V2. In vision-language modeling, our proposed HorCLIP surpasses mainstream Vision Transformer and ConvNeXt architectures with shorter training schedules on ImageNet zero-shot classification and shows remarkably higher performance on vision-language dense representation tasks on COCO Panoptic datasets. Our results demonstrate that g nConv with high-order spatial interactions can be a new basic operation for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TPAMI.2025.3603181DOI Listing

Publication Analysis

Top Keywords

high-order spatial
16
spatial interactions
16
vision transformers
16
point cloud
12
image recognition
8
cloud analysis
8
vision-language modeling
8
semantic segmentation
8
interactions
6
vision
6

Similar Publications