Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pretraining objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pretraining framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that (1) xTrimoPGLM substantially outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced three-dimensional structural prediction model that surpasses existing language model-based tools. (2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science. Trained weight for the xTrimoPGLM model, and downstream datasets are available at https://huggingface.co/biomap-research .

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41592-025-02636-zDOI Listing

Publication Analysis

Top Keywords

protein sequences
12
protein
9
protein language
8
protein understanding
8
xtrimopglm
7
xtrimopglm unified
4
unified 100-billion-parameter
4
100-billion-parameter pretrained
4
pretrained transformer
4
transformer deciphering
4

Similar Publications

sp. nov., a novel halotolerant, flexirubin-type pigment-producing bacterium of the family .

Int J Syst Evol Microbiol

September 2025

Second Institute of Oceanography, Key Laboratory of Marine Ecosystem Dynamics, Ministry of Natural Resources, Hangzhou 310018, PR China.

A Gram-staining-negative, non-motile, aerobic, rod-shaped bacterium, designated 14752, was isolated from a saline lake in Xinjiang Uygur Autonomous Region, China. The strain was subjected to a taxonomic study using a polyphasic approach. Strain 14752 was able to grow at 4-40 ℃ (optimum 28 ℃), pH 6.

View Article and Find Full Text PDF

SPACE: STRING proteins as complementary embeddings.

Bioinformatics

September 2025

Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, 2200, Denmark.

Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.

Results: We leveraged the STRING database of protein networks and orthology relations for 1,322 eukaryotes to generate network-based cross-species protein embeddings.

View Article and Find Full Text PDF

Escherichia coli strain O55 contains two cryptic plasmids that depend on each other to replicate.

Arch Microbiol

September 2025

División de Ciencias Naturales y Exactas, Departamento de Biología, Universidad de Guanajuato, Zip Code 36050, Guanajuato, Mexico.

Plasmids are fundamental to molecular biology and biotechnology, playing a crucial role in bacterial evolution. Some plasmids are linked to complex cellular dynamics, including pathogenicity islands, antibiotic resistance, and gene mobilization. This study reports the isolation and sequencing of two cryptic plasmids with different electrophoretic mobilities from the Escherichia coli clinical isolate O55.

View Article and Find Full Text PDF

PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modeling and machine learning.

Acta Crystallogr F Struct Biol Commun

October 2025

Science and Technology Facilities Council, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom.

Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated data sets. Being able to easily access and utilize these is crucial to allow researchers to make optimal use of their research effort.

View Article and Find Full Text PDF

Machine learning-based analysis of the impact of 5' untranslated region on protein expression.

Nucleic Acids Res

September 2025

School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, No. 100 Waihuanxi Road, Guangzhou 510006, China.

The 5' untranslated region (5'UTR) plays a crucial regulatory role in messenger RNA (mRNA), with modified 5'UTRs extensively utilized in vaccine production, gene therapy, etc. Nevertheless, manually optimizing 5'UTRs may encounter difficulties in balancing the effects of various cis-elements. Consequently, multiple 5'UTR libraries have been created, and machine learning models have been employed to analyze and predict translation efficiency (TE) and protein expression, providing insights into critical regulatory features.

View Article and Find Full Text PDF