Regularizing transformers with deep probabilistic layers.

Aurora Cobo Aguilera , Pablo M Olmos , Antonio Artés-Rodríguez , Fernando Pérez-Cruz

Neural Netw

Swiss Data Science Institute (ETHZ/EPFL), Universitatstrasse 25, 8006, Zurich, Switzerland. Electronic address:

Published: April 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Language models (LM) have grown non-stop in the last decade, from sequence-to-sequence architectures to attention-based Transformers. However, regularization is not deeply studied in those structures. In this work, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer. We study its advantages regarding the depth where it is placed and prove its effectiveness in several scenarios. Experimental result demonstrates that the inclusion of deep generative models within Transformer-based architectures such as BERT, RoBERTa, or XLM-R can bring more versatile models, able to generalize better and achieve improved imputation score in tasks such as SST-2 and TREC or even impute missing/noisy words with richer text.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.neunet.2023.01.032	DOI Listing

Publication Analysis

Top Keywords

regularizing transformers

transformers deep

deep probabilistic

probabilistic layers

layers language

language models

models grown

grown non-stop

non-stop decade

decade sequence-to-sequence

Similar Publications

Segmenting the motion components of a video: A long-term unsupervised model.

IEEE Trans Pattern Anal Mach Intell

September 2025

Etienne Meunier , Patrick Bouthemy

Human beings have the ability to continuously analyze a video and immediately extract the motion components. We want to adopt this paradigm to provide a coherent and stable motion segmentation over the video sequence. In this perspective, we propose a novel long-term spatio-temporal model operating in a totally unsupervised way.

View Article and Find Full Text PDF

Similar Publications

Transformer-based ECG classification for early detection of cardiac arrhythmias.

Front Med (Lausanne)

August 2025

Universidad Internacional Iberoamericana, Arecibo, PR, United States.

Sunnia Ikram , Amna Ikram , Harvinder Singh , Malik Daler Ali Awan , Sajid Naveed

Electrocardiogram (ECG) classification plays a critical role in early detection and trocardiogram (ECG) classification plays a critical role in early detection and monitoring cardiovascular diseases. This study presents a Transformer-based deep learning framework for automated ECG classification, integrating advanced preprocessing, feature selection, and dimensionality reduction techniques to improve model performance. The pipeline begins with signal preprocessing, where raw ECG data are denoised, normalized, and relabeled for compatibility with attention-based architectures.

View Article and Find Full Text PDF

Similar Publications

A Dual-Discriminator Generative Adversarial Network for Anomaly Detection.

IEEE Trans Neural Netw Learn Syst

September 2025

Da Ding , Youquan Wang , Haicheng Tao , Jia Wu , Jie Cao

Multivariate time series anomaly detection has shown potential in various fields, such as finance, aerospace, and security. The fuzzy definition of data anomalies, the complexity of data patterns, and the scarcity of abnormal data samples pose significant challenges to anomaly detection. Researchers have extensively employed autoencoders (AEs) and generative adversarial networks (GANs) in studying time series anomaly detection methods.

View Article and Find Full Text PDF

Similar Publications

Transformer-based arterial spin labeling perfusion MRI denoising.

Vis Comput

July 2025

Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine.

Muhammad Nadeem Cheema , Lei Zhang , Anam Nazir , Yiran Li , John A Detre

Arterial Spin Labeling (ASL) perfusion MRI is the only non-invasive technique for quantifying regional cerebral blood flow (CBF) visualization, which is an important physiological variable. ASL MRI has a relatively low signal-to-noise-ratio (SNR), making it challenging to achieve high quality CBF images using limited data. Promising ASL CBF denoising results have been shown in recent convolutional neural network (CNN)-based methods.

View Article and Find Full Text PDF

Similar Publications

Origin centric and part based pose decomposition for 3D human pose estimation.

Sci Rep

August 2025

College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China.

Zhijie Lin , Jinxin Yao , Juan Huang , Jingjing Chen , Yingying Xu

Transformer-based approaches have recently made significant advancements in 3D human pose estimation from 2D inputs. Existing methods typically either consider the entire 2D skeleton for global features extraction or break it into independent parts for local features learning. However, capturing the spatial dependencies of the entire 2D skeleton does not effectively facilitate learning local spatial features, while partitioning the skeleton into independent segments disrupts the relevance of individual joints to the whole.

View Article and Find Full Text PDF

Similar Publications