Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties.

Asu Busra Temizer , Gökçe Uludoğan , Rıza Özçelik , Taha Koulani , Elif Ozkirimli , Kutlu O Ulgen , Nilgun Karali , Arzucan Özgür

Mol Inform

Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey.

Published: March 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.

Download full-text PDF	Source
http://dx.doi.org/10.1002/minf.202300249	DOI Listing

Publication Analysis

Top Keywords

key chemical

chemical

smiles tokenization

machine learning

learning models

drug discovery

molecular sequences

protein targets

key

exploring data-driven

Similar Publications

Green synthesis of silver nanoparticles using Ocimum sanctum for efficient Congo red dye removal: a response surface methodology approach.

Environ Monit Assess

September 2025

Department of Civil Engineering, Faculty of Engineering, Karpagam Academy of Higher Education, Pollachi Main Road, Eachanari Post, Coimbatore, Tamil Nadu, 641021, India.

S Murugeshwari , B Senthil Rathi , N Kalaiarasi , R M Saravana Kumar , I Arunkumar

Synthetic dyes, such as Congo red (CR), pose serious threats to human health and aquatic ecosystems because of their carcinogenicity and resistance to degradation, necessitating the development of efficient and eco-friendly remediation strategies. In this study, silver nanoparticles (AgNPs) were synthesized via a green method using Ocimum sanctum (holy basil) leaf extract and applied for CR dye removal from aqueous solutions. The adsorption process was optimized using response surface methodology (RSM) based on Box-Behnken design (BBD), evaluating the influence of key parameters including pH, AgNP dosage, initial dye concentration, contact time, and temperature.

View Article and Find Full Text PDF

Similar Publications

Harnessing CoO/AgMoO/CeO ternary nanocomposites for solar light-induced degradation of anthropogenic dye contaminants.

Environ Geochem Health

September 2025

Department of Chemistry, Government Arts College(A), Salem, Tamil Nadu, 636007, India.

Mahendran Ananthkumar , Elango Vasithira , Arumugam Priyadharsan , Rajendran Ranjith , Subhav Singh

A CoO/AgMoO/CeOternary nanocomposites photocatalyst was successfully synthesized through a straightforward ethanol-assisted chemical method. Comprehensive characterization of its structural and optical properties was conducted using X-ray diffraction (XRD), Fourier-transform infrared spectroscopy (FTIR), Raman spectroscopy, scanning electron microscopy (SEM), transmission electron microscopy (TEM), UV-Vis diffuse reflectance spectroscopy (UV-DRS), and photoluminescence (PL) analysis. XRD analysis confirmed the presence of CoO, AgMoO and CeO in the ternary composite sample.

View Article and Find Full Text PDF

Similar Publications

Chalasoergodimers A-E, heterodimers with multiple polymerization modes from a marine-derived Chaetomium sp. fungus.

Nat Prod Bioprospect

September 2025

College of Pharmaceutical Sciences, Key Laboratory of Medicinal Chemistry and Molecular Diagnostics of Education Ministry of China, State Key Laboratory of New Pharmaceutical Preparations and Excipients, Hebei University, Baoding, 071002, People's Republic of China.

Ze-Hong Lin , Han-Wen Shan , Li-Kun Yang , Tian-Tian Sun , Li-Ying He

Five new heterodimers, chalasoergodimers A-E (1-5), and three known heterodimers (6-8), along with four chaetoglobosin monomers (9-12), were isolated from a marine-derived Chaetomium sp. fungus. The structures of new compounds 1-5 were elucidated by HRESIMS, NMR, chemical calculated C NMR and ECD methods.

View Article and Find Full Text PDF

Similar Publications

A rapid imaging-based screen for induced-proximity degraders identifies a potent degrader of oncoprotein SKP2.

Nat Biotechnol

September 2025

Key Laboratory of RNA Innovation, Science and Engineering, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai, China.

Yankai Chu , Shishuang Chen , Mingyang Yang , Yin Chen , Huiling Fang

Targeted protein degraders hold potential as therapeutic agents to target conventionally 'undruggable' proteins. Here, we develop a high-throughput screen, DEath FUSion Escaper (DEFUSE), to identify small-molecule protein degraders. By conjugating the protein of interest to a fast-acting triggerable death protein, this approach translates target protein degradation into a cell survival phenotype to illustrate the presence of degraders.

View Article and Find Full Text PDF

Similar Publications

Monatomic glass formation through competing order balance.

Nat Commun

September 2025

Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan.

Yuan-Chao Hu , J T Zhai , Le-Hua Liu , W W Zhang , Hai-Yang Bai

The phase transformation of single-element systems is a fundamental natural process with broad implications, yet many aspects remain puzzling despite their simplicity. For instance, transition metals, Tantalum (Ta) and Zirconium (Zr), commonly form body-centred cubic crystals when supercooled. However, according to large-scale computer simulations, their crystallisation rates can differ by over 100 times.

View Article and Find Full Text PDF

Similar Publications