Leveraging protein language models for accurate multiple sequence alignments.

Claire D McWhite , Isabel Armour-Garb , Mona Singh

Genome Res

Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA;

Published: July 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538487	PMC
http://dx.doi.org/10.1101/gr.277675.123	DOI Listing

Publication Analysis

Top Keywords

protein language

language models

amino acid

contextual embeddings

multiple sequence

msa algorithms

guide tree

substitution matrices

proteins low

sequence

Similar Publications

Comment on "spatially dependent tissue distribution of thyroid hormones by plasma thyroid hormone binding proteins".

Pflugers Arch

September 2025

Department of Research Analytics, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, India.

Amnuay Kleebayoon , Viroj Wiwanitkit

View Article and Find Full Text PDF

Similar Publications

Interactive effects of arterial stiffness and Alzheimer's disease risk on cognitive decline in older adults without dementia.

Alzheimers Dement

September 2025

Research Service, VA San Diego Healthcare System, San Diego, California, USA.

Lauren Edwards , Denis S Smirnov , Kelsey R Thomas , Katherine Longardner , Lisa Delano-Wood

Introduction: Among individuals who are amyloid biomarker-positive or apolipoprotein E (APOE) ε4 carriers, arterial stiffness reflected by higher pulse wave velocity (PWV) has been associated with lower cognition cross-sectionally. Less is known about longitudinal associations.

Methods: The sample included 152 older adults without dementia.

View Article and Find Full Text PDF

Similar Publications

Developing student proficiency in ChatGPT-driven active recall practices and self-guided inquiry.

Adv Physiol Educ

September 2025

Amie J Dirks-Naylor

Artificial intelligence (AI) tools like ChatGPT offer new opportunities to enhance student learning through active recall and self-directed inquiry. This study aimed to determine student perceptions of a classroom assignment designed to develop proficiency in using ChatGPT for these strategies. First-semester Doctor of Pharmacy students in a foundational sciences course completed an assignment using ChatGPT for active recall.

View Article and Find Full Text PDF

Similar Publications

Graph neural network integrated with pretrained protein language model for predicting human-virus protein-protein interactions.

Brief Bioinform

August 2025

State Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural University, No. 2 Yuanmingyuan West Road, Haidian District, Beijing 100193, China.

Linyang Jiang , Xiaodi Yang , Xiaokun Guo , Dianke Li , Jiajun Li

The systematic identification of human-virus protein-protein interactions (PPIs) is a critical step toward elucidating the underlying mechanisms of viral infection, directly informing the development of targeted interventions against existing and emerging viral threats. In this work, we presented DeepGNHV, an end-to-end framework that integrated a pretrained protein language model with structural features derived from AlphaFold2 and leveraged graph attention networks to predict human-virus PPIs. In comparison to other state-of-the-art approaches, DeepGNHV exhibited superior predictive performance, especially when applied to viral proteins absent from the training process, indicating its strong generalization capability for detecting newly emerging virus-related PPIs.

View Article and Find Full Text PDF

Similar Publications

Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information.

Brief Bioinform

August 2025

School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang Road, Shushan District, Hefei, Anhui 230036, China.

Xiang Li , Wei Peng , Xiaolei Zhu

Protein-nucleic acid binding sites play a crucial role in biological processes such as gene expression, signal transduction, replication, and transcription. In recent years, with the development of artificial intelligence, protein language models, graph neural networks, and transformer architectures have been adopted to develop both structure-based and sequence-based predictive models. Structure-based methods benefit from the spatial relationship between residues and have shown promising performance.

View Article and Find Full Text PDF

Similar Publications