Comparison study of dominant molecular sequence representation based on diffusion model.

J Comput Aided Mol Des

School of Medical Information and Engineering, Southwest Medical University, Luzhou, Sichuan Province, 646000, China.

Published: July 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC's primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.

Download full-text PDF

Source
http://dx.doi.org/10.1007/s10822-025-00614-3DOI Listing

Publication Analysis

Top Keywords

molecular representation
20
molecular
12
diffusion model
12
representation
8
representation learning
8
molecular generation
8
drug design
8
model training
8
representation languages
8
selfies smarts
8

Similar Publications

Aberrant DNA methylation has been described in nearly all human cancers, yet its interplay with genomic alterations during tumor evolution is poorly understood. To explore this, we performed reduced representation bisulfite sequencing on 217 tumor and matched normal regions from 59 patients with non-small cell lung cancer from the TRACERx study to deconvolve tumor methylation. We developed two metrics for integrative evolutionary analysis with DNA and RNA sequencing data.

View Article and Find Full Text PDF

Neoadjuvant immunochemotherapy (nICT) has demonstrated significant potential in improving pathological response rates and survival outcomes for patients with locally advanced esophageal squamous cell carcinoma (ESCC). However, substantial interindividual variability in therapeutic outcomes highlights the urgent need for more precise predictive tools to guide clinical decision-making. Traditional biomarkers remain limited in both predictive performance and clinical feasibility.

View Article and Find Full Text PDF

The calculation of the highest occupied molecular orbital-lowest unoccupied molecular orbital (HOMO-LUMO) gap for chemical molecules is computationally intensive using quantum mechanics (QM) methods, while experimental determination is often costly and time-consuming. Machine Learning (ML) offers a cost-effective and rapid alternative, enabling efficient predictions of HOMO-LUMO gap values across large data sets without the need for extensive QM computations or experiments. ML models facilitate the screening of diverse molecules, providing valuable insights into complex chemical spaces and integrating seamlessly into high-throughput workflows to prioritize candidates for experimental validation.

View Article and Find Full Text PDF

Hamiltonian Grid-Based QM/MM Method with Mean-Field Embedding for Simulating Arbitrary Slab Geometries.

J Chem Theory Comput

September 2025

Materials DX Research Center, National Institute of Advanced Industrial Science and Technology, Tsukuba Central 2, 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, Japan.

The quantum mechanics/molecular mechanics (QM/MM) method is a powerful approach for investigating solid surfaces in contact with various types of media, since it allows for flexible modeling of complex interfaces while maintaining an all-atom representation. The mean-field QM/MM method is an average reaction field model within the QM/MM framework. The method addresses the challenges associated with the statistical sampling of interfacial atomic configurations of a medium and enables efficient calculation of free energies.

View Article and Find Full Text PDF

Oral bioavailability property prediction based on task similarity transfer learning.

Mol Divers

September 2025

Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, Nanjing, 211198, China.

Drug absorption significantly influences pharmacokinetics. Accurately predicting human oral bioavailability (HOB) is essential for optimizing drug candidates and improving clinical success rates. The traditional method based on experiment is a common way to obtain HOB, but the experimental method is time-consuming and costly.

View Article and Find Full Text PDF