Severity: Warning
Message: file_get_contents(https://...@gmail.com&api_key=61f08fa0b96a73de8c900d749fcb997acc09&a=1): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests
Filename: helpers/my_audit_helper.php
Line Number: 197
Backtrace:
File: /var/www/html/application/helpers/my_audit_helper.php
Line: 197
Function: file_get_contents
File: /var/www/html/application/helpers/my_audit_helper.php
Line: 271
Function: simplexml_load_file_from_url
File: /var/www/html/application/helpers/my_audit_helper.php
Line: 3165
Function: getPubMedXML
File: /var/www/html/application/controllers/Detail.php
Line: 597
Function: pubMedSearch_Global
File: /var/www/html/application/controllers/Detail.php
Line: 511
Function: pubMedGetRelatedKeyword
File: /var/www/html/index.php
Line: 317
Function: require_once
98%
921
2 minutes
20
Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the proposed operation, we construct a new family of generic vision backbones for various visual modalities and tasks, including HorNet and HorFPN for image recognition, Hor3D for point cloud analysis, and HorCLIP for vision-language modeling. For image recognition, we propose HorNet as a stronger visual encoder, where we conduct extensive experiments on ImageNet classification, COCO object detection, and ADE20K semantic segmentation. HorNet outperforms Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and larger model sizes. Apart from image encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. For point cloud analysis, we design Hor3D, demonstrating the efficacy of high-order interactions for unstructured point cloud data through experiments on challenging 3D semantic segmentation tasks in S3DIS and ScanNet V2. In vision-language modeling, our proposed HorCLIP surpasses mainstream Vision Transformer and ConvNeXt architectures with shorter training schedules on ImageNet zero-shot classification and shows remarkably higher performance on vision-language dense representation tasks on COCO Panoptic datasets. Our results demonstrate that g nConv with high-order spatial interactions can be a new basic operation for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2025.3603181 | DOI Listing |