Improving functional protein generation via foundation model-derived latent space likelihood optimization.

bioRxiv

Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.

Published: January 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

A variety of deep generative models have been adopted to perform functional protein generation. Compared to 3D protein design, sequence-based generation methods, which aim to generate amino acid sequences with desired functions, remain a major approach for functional protein generation due to the abundance and quality of protein sequence data, as well as the relatively low modeling complexity for training. Although these models are typically trained to match protein sequences from the training data, exact matching of every amino acid is not always essential. Certain amino acid changes (e.g., mismatches, insertions, and deletions) may not necessarily lead to functional changes. This suggests that maximizing the training data likelihood beyond the amino acid sequence space could yield better generative models. Pre-trained protein large language models (PLMs) like ESM2 can encode protein sequences into a latent space, potentially serving as functional validators. We propose training functional protein sequence generative models by simultaneously optimizing the likelihood of training data in both the amino acid sequence space and the latent space derived from a PLM. This training scheme can also be viewed as a knowledge distillation approach that dynamically re-weights samples during training. We applied our method to train GPT-like models (i.e., autoregressive transformers) for antimicrobial peptide (AMP) and malate dehydrogenase (MDH) generation tasks. Computational experiments confirmed that our method outperformed various deep generative models (e.g., generative adversarial net, variational autoencoder, and GPT model without the proposed training strategy) on these tasks, demonstrating the effectiveness of our multi-likelihood optimization strategy.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11741333PMC
http://dx.doi.org/10.1101/2025.01.07.631724DOI Listing

Publication Analysis

Top Keywords

amino acid
20
functional protein
16
generative models
16
protein generation
12
latent space
12
training data
12
protein
9
deep generative
8
protein sequence
8
training
8

Similar Publications

Parkinson's disease (PD) is characterized by impairments in motor control following the degeneration of dopamine-producing neurons located in the substantia nigra pars compacta. Environmental pesticides such as Paraquat (PQ) and Maneb (MB) contribute to the onset of PD by inducing oxidative stress (OS). This study evaluated the therapeutic efficacy of moderate physical activity (PA) on both motor and non-motor symptoms in a Wistar rat model of Paraquat and Maneb (PQ/MB) induced PD.

View Article and Find Full Text PDF

Distinct codon usage signatures reflecting evolutionary and pathogenic adaptation in the Acinetobacter baumannii complex.

Eur J Clin Microbiol Infect Dis

September 2025

School of Bioengineering and Biosciences, Department of Biochemistry, Lovely Professional University, Punjab, 144411, India.

Purpose: This study investigates codon usage and amino acid usage bias in the genus Acinetobacter to uncover the evolutionary forces shaping these patterns and their implications for pathogenicity and biotechnology.

Methods: Codon usage patterns were examined in representative genomes of the genus Acinetobacter using standard codon bias indices, including GC content, relative synonymous codon usage (RSCU), effective number of codons (ENC), and codon adaptation index (CAI). Neutrality and parity plots were employed to evaluate the relative influence of mutational pressure and natural selection on codon preferences.

View Article and Find Full Text PDF

This comprehensive review examines the versatile applications and effects of Moringa oleifera across multiple fish species in aquaculture systems amid growing challenges of rising feed costs and antimicrobial resistance. M. oleifera, commonly called the Miracle tree, contains an exceptional nutritional profile with high protein content (22.

View Article and Find Full Text PDF

Background: Hearing loss (HL) is one of the most common congenital anomalies and is a complex etiologically diverse condition. Molecular genetic characterization of HL remains challenging owing to the high genetic heterogeneity. This study aimed to screen for potential disease-causing genetic variations in a cohort of Indian patients with congenital bilateral severe-to-profound sensorineural HL.

View Article and Find Full Text PDF

Pseudoduganella rhizocola sp. nov., Isolated from Rhizospheric Soil.

Curr Microbiol

September 2025

Department of Integrative Biotechnology, Sungkyunkwan University, Natural Science Campus, 2066 Seobu-ro, Jangan-Gu, Suwon-Si, Gyeonggi-Do, 16419, Republic of Korea.

A novel bacterial strain, SM-13 was isolated from the rhizospheric soil of Epipremnum aureum (Jade Pothos) sampled in Suwon, Republic of Korea. The isolate was Gram-stain-negative, aerobic, motile, rod-shaped, cream-coloured, oxidase- and catalase-positive. Strain SM-13 grew at the range of 15-37 °C (optimum, 25 °C), at pH 6.

View Article and Find Full Text PDF