Comparison study of dominant molecular sequence representation based on diffusion model.

Yongrui Cui , Dongjing Shan , Qiheng Lu , Beijia Zou , Huali Zhang , Jin Li , Jiashun Mao

J Comput Aided Mol Des

School of Medical Information and Engineering, Southwest Medical University, Luzhou, Sichuan Province, 646000, China.

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC's primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.

Download full-text PDF	Source
http://dx.doi.org/10.1007/s10822-025-00614-3	DOI Listing

Publication Analysis

Top Keywords

molecular representation

molecular

diffusion model

representation

representation learning

molecular generation

drug design

model training

representation languages

selfies smarts

Similar Publications

DNA methylation cooperates with genomic alterations during non-small cell lung cancer evolution.

Nat Genet

September 2025

Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, London, UK.

Francisco Gimeno-Valiente , Carla Castignani , Elizabeth Larose Cadieux , Nana E Mensah , Xiaohong Liu

Aberrant DNA methylation has been described in nearly all human cancers, yet its interplay with genomic alterations during tumor evolution is poorly understood. To explore this, we performed reduced representation bisulfite sequencing on 217 tumor and matched normal regions from 59 patients with non-small cell lung cancer from the TRACERx study to deconvolve tumor methylation. We developed two metrics for integrative evolutionary analysis with DNA and RNA sequencing data.

View Article and Find Full Text PDF

Similar Publications

Artificial intelligence in medical imaging empowers precision neoadjuvant immunochemotherapy in esophageal squamous cell carcinoma.

J Immunother Cancer

September 2025

CAS Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Jia Fu , Xiaoying Huang , Mengjie Fang , Xin Feng , Xu-Yao Zhang

Neoadjuvant immunochemotherapy (nICT) has demonstrated significant potential in improving pathological response rates and survival outcomes for patients with locally advanced esophageal squamous cell carcinoma (ESCC). However, substantial interindividual variability in therapeutic outcomes highlights the urgent need for more precise predictive tools to guide clinical decision-making. Traditional biomarkers remain limited in both predictive performance and clinical feasibility.

View Article and Find Full Text PDF

Similar Publications

Predicting HOMO-LUMO Gaps Using Hartree-Fock Calculated Data and Machine Learning Models.

J Chem Inf Model

September 2025

Department of Chemistry, Delaware State University, Dover, Delaware 19901, United States.

Md Mehedi Hasan , Omid Tarkhaneh , Sharene D Bungay , Raymond A Poirier , Shahidul M Islam

The calculation of the highest occupied molecular orbital-lowest unoccupied molecular orbital (HOMO-LUMO) gap for chemical molecules is computationally intensive using quantum mechanics (QM) methods, while experimental determination is often costly and time-consuming. Machine Learning (ML) offers a cost-effective and rapid alternative, enabling efficient predictions of HOMO-LUMO gap values across large data sets without the need for extensive QM computations or experiments. ML models facilitate the screening of diverse molecules, providing valuable insights into complex chemical spaces and integrating seamlessly into high-throughput workflows to prioritize candidates for experimental validation.

View Article and Find Full Text PDF

Similar Publications

Hamiltonian Grid-Based QM/MM Method with Mean-Field Embedding for Simulating Arbitrary Slab Geometries.

J Chem Theory Comput

September 2025

Materials DX Research Center, National Institute of Advanced Industrial Science and Technology, Tsukuba Central 2, 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, Japan.

Hiroshi Nakano , Hisao Nakamura

The quantum mechanics/molecular mechanics (QM/MM) method is a powerful approach for investigating solid surfaces in contact with various types of media, since it allows for flexible modeling of complex interfaces while maintaining an all-atom representation. The mean-field QM/MM method is an average reaction field model within the QM/MM framework. The method addresses the challenges associated with the statistical sampling of interfacial atomic configurations of a medium and enables efficient calculation of free energies.

View Article and Find Full Text PDF

Similar Publications

Oral bioavailability property prediction based on task similarity transfer learning.

Mol Divers

September 2025

Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, Nanjing, 211198, China.

Chen Zeng , Chengcheng Xu , Yingxu Liu , Yunya Jiang , Lidan Zheng

Drug absorption significantly influences pharmacokinetics. Accurately predicting human oral bioavailability (HOB) is essential for optimizing drug candidates and improving clinical success rates. The traditional method based on experiment is a common way to obtain HOB, but the experimental method is time-consuming and costly.

View Article and Find Full Text PDF

Similar Publications