A novel methodology on distributed representations of proteins using their interacting ligands.

Hakime Öztürk , Elif Ozkirimli , Arzucan Özgür

Bioinformatics

Department of Computer Engineering, Bogazici University, Istanbul, Turkey.

Published: July 2018

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Motivation: The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared.

Results: We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein sequence-based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure-based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation.

Availability And Implementation: https://github.com/hkmztrk/SMILESVecProteinRepresentation.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022674	PMC
http://dx.doi.org/10.1093/bioinformatics/bty287	DOI Listing

Publication Analysis

Top Keywords

protein representation

representation methods

protein

proteins

representation proteins

bioinformatics problems

proteins bind

ligands proteins

smiles strings

strings ligands

Similar Publications

SPACE: STRING proteins as complementary embeddings.

Bioinformatics

September 2025

Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, 2200, Denmark.

Dewei Hu , Damian Szklarczyk , Christian von Mering , Lars Juhl Jensen

Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.

Results: We leveraged the STRING database of protein networks and orthology relations for 1,322 eukaryotes to generate network-based cross-species protein embeddings.

View Article and Find Full Text PDF

Similar Publications

Deciphering the Structural Code of Proteins with Deep Graph Learning.

IEEE Trans Comput Biol Bioinform

September 2025

Xiaoyi Yin , Yue Zhao , Xin Liu , Zhen Cui , Tong Zhang

Deciphering the three-dimensional structure of proteins remains a grand challenge in biology and medicine, as it holds the key to understanding their biological functions and facilitating drug discovery. In this paper, we introduce DECIPHER (Deep Encoding of Cellular Interactions and Protein HiErarchical Representation), a novel deep graph learning framework for protein structure prediction. By representing proteins as graphs, where residues and atoms serve as nodes and their interactions form edges, we capture the intricate spatial relationships within these complex biomolecules.

View Article and Find Full Text PDF

Similar Publications

GNNenrich: a novel method for pathway enrichment analysis based on Graph Neural Network.

Bioinformatics

September 2025

Centre National de Recherche en Génomique Humaine, Institut François Jacob CEA Université Paris-Saclay.

Mallek Mziou-Sallami , Pierrick Roger , Arnaud Gloaguen , Claire Dandine-Roulland , Thierry Jiogho Ngaho

Motivation: Graph Neural Network (GNN) models have emerged in many fields and notably for biological networks constituted by genes or proteins and their interactions. The majority of enrichment study methods apply over-representation analysis and gene/protein set scores according to the existing overlap between pathways. Such methods neglect knowledges coming from the interactions between the gene/protein sets.

View Article and Find Full Text PDF

Similar Publications

Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information.

Brief Bioinform

August 2025

School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang Road, Shushan District, Hefei, Anhui 230036, China.

Xiang Li , Wei Peng , Xiaolei Zhu

Protein-nucleic acid binding sites play a crucial role in biological processes such as gene expression, signal transduction, replication, and transcription. In recent years, with the development of artificial intelligence, protein language models, graph neural networks, and transformer architectures have been adopted to develop both structure-based and sequence-based predictive models. Structure-based methods benefit from the spatial relationship between residues and have shown promising performance.

View Article and Find Full Text PDF

Similar Publications

GPT2-ICC: A data-driven approach for accurate ion channel identification using pre-trained large language models.

J Pharm Anal

August 2025

Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China.

Zihan Zhou , Yang Yu , Chengji Yang , Leyan Cao , Shaoying Zhang

Current experimental and computational methods have limitations in accurately and efficiently classifying ion channels within vast protein spaces. Here we have developed a deep learning algorithm, GPT2 Ion Channel Classifier (GPT2-ICC), which effectively distinguishing ion channels from a test set containing approximately 239 times more non-ion-channel proteins. GPT2-ICC integrates representation learning with a large language model (LLM)-based classifier, enabling highly accurate identification of potential ion channels.

View Article and Find Full Text PDF

Similar Publications