Learnable Feature Augmentation Framework for Temporal Action Localization.

Yepeng Tang , Weining Wang , Chunjie Zhang , Jing Liu , Yao Zhao

IEEE Trans Image Process

Published: June 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Temporal action localization (TAL) has drawn much attention in recent years, however, the performance of previous methods is still far from satisfactory due to the lack of annotated untrimmed video data. To deal with this issue, we propose to improve the utilization of current data through feature augmentation. Given an input video, we first extract video features with pre-trained video encoders, and then randomly mask various semantic contents of video features to consider different views of video features. To avoid damaging important action-related semantic information, we further develop a learnable feature augmentation framework to generate better views of videos. In particular, a Mask-based Feature Augmentation Module (MFAM) is proposed. The MFAM has three advantages: 1) it captures the temporal and semantic relationships of original video features, 2) it generates masked features with indispensable action-related information, and 3) it randomly recycles some masked information to ensure diversity. Finally, we input the masked features and the original features into shared action detectors respectively, and perform action classification and localization jointly for model learning. The proposed framework can improve the robustness and generalization of action detectors by learning more and better views of videos. In the testing stage, the MFAM can be removed, which does not bring extra computational costs. Extensive experiments are conducted on four TAL benchmark datasets. Our proposed framework significantly improves different TAL models and achieves the state-of-the-art performances.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2024.3413599	DOI Listing

Publication Analysis

Top Keywords

feature augmentation

video features

learnable feature

augmentation framework

temporal action

action localization

better views

views videos

masked features

action detectors

A PHP Error was encountered