98%
921
2 minutes
20
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538487 | PMC |
http://dx.doi.org/10.1101/gr.277675.123 | DOI Listing |
Pflugers Arch
September 2025
Department of Research Analytics, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, India.
Alzheimers Dement
September 2025
Research Service, VA San Diego Healthcare System, San Diego, California, USA.
Introduction: Among individuals who are amyloid biomarker-positive or apolipoprotein E (APOE) ε4 carriers, arterial stiffness reflected by higher pulse wave velocity (PWV) has been associated with lower cognition cross-sectionally. Less is known about longitudinal associations.
Methods: The sample included 152 older adults without dementia.
Adv Physiol Educ
September 2025
Artificial intelligence (AI) tools like ChatGPT offer new opportunities to enhance student learning through active recall and self-directed inquiry. This study aimed to determine student perceptions of a classroom assignment designed to develop proficiency in using ChatGPT for these strategies. First-semester Doctor of Pharmacy students in a foundational sciences course completed an assignment using ChatGPT for active recall.
View Article and Find Full Text PDFBrief Bioinform
August 2025
State Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural University, No. 2 Yuanmingyuan West Road, Haidian District, Beijing 100193, China.
The systematic identification of human-virus protein-protein interactions (PPIs) is a critical step toward elucidating the underlying mechanisms of viral infection, directly informing the development of targeted interventions against existing and emerging viral threats. In this work, we presented DeepGNHV, an end-to-end framework that integrated a pretrained protein language model with structural features derived from AlphaFold2 and leveraged graph attention networks to predict human-virus PPIs. In comparison to other state-of-the-art approaches, DeepGNHV exhibited superior predictive performance, especially when applied to viral proteins absent from the training process, indicating its strong generalization capability for detecting newly emerging virus-related PPIs.
View Article and Find Full Text PDFBrief Bioinform
August 2025
School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang Road, Shushan District, Hefei, Anhui 230036, China.
Protein-nucleic acid binding sites play a crucial role in biological processes such as gene expression, signal transduction, replication, and transcription. In recent years, with the development of artificial intelligence, protein language models, graph neural networks, and transformer architectures have been adopted to develop both structure-based and sequence-based predictive models. Structure-based methods benefit from the spatial relationship between residues and have shown promising performance.
View Article and Find Full Text PDF