PepBERT: Lightweight language models for bioactive peptide representation.

Zhenjiao Du , Doina Caragea , Xiaolong Guo , Yonghui Li

bioRxiv

Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA.

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limits the effectiveness of pLMs for peptide-specific applications. Here, we present PepBERT, a lightweight and efficient peptide language model specifically designed for encoding peptide sequences. Two versions of the model-PepBERT-large (4.9 million parameters) and PepBERT-small (1.86 million parameters)-were pretrained from scratch using four custom peptide datasets and evaluated on nine peptide-related downstream prediction tasks. Both PepBERT models achieved performance superior to or comparable to the benchmark model, ESM-2 with 7.5 million parameters, on 8 out of 9 datasets. Overall, PepBERT provides a compact yet effective solution for generating high-quality peptide representations for downstream applications. By enabling more accurate representation and prediction of bioactive peptides, PepBERT can accelerate the discovery of food-derived bioactive peptides with health-promoting properties, supporting the development of sustainable functional foods and value-added utilization of food processing by-products. The datasets, source codes, pretrained models, and tutorials for the usage of PepBERT are available at https://github.com/dzjxzyd/PepBERT.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12236832	PMC
http://dx.doi.org/10.1101/2025.04.08.647838	DOI Listing

Publication Analysis

Top Keywords

pepbert lightweight

language models

peptide-related downstream

bioactive peptides

pepbert

peptide

lightweight language

models

models bioactive

bioactive peptide

Similar Publications

PepBERT: Lightweight language models for bioactive peptide representation.

bioRxiv

July 2025

Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA.

Zhenjiao Du , Doina Caragea , Xiaolong Guo , Yonghui Li

View Article and Find Full Text PDF

Similar Publications