98%
921
2 minutes
20
Motivation: The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared.
Results: We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein sequence-based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure-based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation.
Availability And Implementation: https://github.com/hkmztrk/SMILESVecProteinRepresentation.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022674 | PMC |
http://dx.doi.org/10.1093/bioinformatics/bty287 | DOI Listing |
Bioinformatics
September 2025
Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, 2200, Denmark.
Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.
Results: We leveraged the STRING database of protein networks and orthology relations for 1,322 eukaryotes to generate network-based cross-species protein embeddings.
IEEE Trans Comput Biol Bioinform
September 2025
Deciphering the three-dimensional structure of proteins remains a grand challenge in biology and medicine, as it holds the key to understanding their biological functions and facilitating drug discovery. In this paper, we introduce DECIPHER (Deep Encoding of Cellular Interactions and Protein HiErarchical Representation), a novel deep graph learning framework for protein structure prediction. By representing proteins as graphs, where residues and atoms serve as nodes and their interactions form edges, we capture the intricate spatial relationships within these complex biomolecules.
View Article and Find Full Text PDFBioinformatics
September 2025
Centre National de Recherche en Génomique Humaine, Institut François Jacob CEA Université Paris-Saclay.
Motivation: Graph Neural Network (GNN) models have emerged in many fields and notably for biological networks constituted by genes or proteins and their interactions. The majority of enrichment study methods apply over-representation analysis and gene/protein set scores according to the existing overlap between pathways. Such methods neglect knowledges coming from the interactions between the gene/protein sets.
View Article and Find Full Text PDFBrief Bioinform
August 2025
School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang Road, Shushan District, Hefei, Anhui 230036, China.
Protein-nucleic acid binding sites play a crucial role in biological processes such as gene expression, signal transduction, replication, and transcription. In recent years, with the development of artificial intelligence, protein language models, graph neural networks, and transformer architectures have been adopted to develop both structure-based and sequence-based predictive models. Structure-based methods benefit from the spatial relationship between residues and have shown promising performance.
View Article and Find Full Text PDFJ Pharm Anal
August 2025
Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China.
Current experimental and computational methods have limitations in accurately and efficiently classifying ion channels within vast protein spaces. Here we have developed a deep learning algorithm, GPT2 Ion Channel Classifier (GPT2-ICC), which effectively distinguishing ion channels from a test set containing approximately 239 times more non-ion-channel proteins. GPT2-ICC integrates representation learning with a large language model (LLM)-based classifier, enabling highly accurate identification of potential ion channels.
View Article and Find Full Text PDF