98%
921
2 minutes
20
In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC's primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1007/s10822-025-00614-3 | DOI Listing |
Nat Genet
September 2025
Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, London, UK.
Aberrant DNA methylation has been described in nearly all human cancers, yet its interplay with genomic alterations during tumor evolution is poorly understood. To explore this, we performed reduced representation bisulfite sequencing on 217 tumor and matched normal regions from 59 patients with non-small cell lung cancer from the TRACERx study to deconvolve tumor methylation. We developed two metrics for integrative evolutionary analysis with DNA and RNA sequencing data.
View Article and Find Full Text PDFJ Immunother Cancer
September 2025
CAS Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Neoadjuvant immunochemotherapy (nICT) has demonstrated significant potential in improving pathological response rates and survival outcomes for patients with locally advanced esophageal squamous cell carcinoma (ESCC). However, substantial interindividual variability in therapeutic outcomes highlights the urgent need for more precise predictive tools to guide clinical decision-making. Traditional biomarkers remain limited in both predictive performance and clinical feasibility.
View Article and Find Full Text PDFJ Chem Inf Model
September 2025
Department of Chemistry, Delaware State University, Dover, Delaware 19901, United States.
The calculation of the highest occupied molecular orbital-lowest unoccupied molecular orbital (HOMO-LUMO) gap for chemical molecules is computationally intensive using quantum mechanics (QM) methods, while experimental determination is often costly and time-consuming. Machine Learning (ML) offers a cost-effective and rapid alternative, enabling efficient predictions of HOMO-LUMO gap values across large data sets without the need for extensive QM computations or experiments. ML models facilitate the screening of diverse molecules, providing valuable insights into complex chemical spaces and integrating seamlessly into high-throughput workflows to prioritize candidates for experimental validation.
View Article and Find Full Text PDFJ Chem Theory Comput
September 2025
Materials DX Research Center, National Institute of Advanced Industrial Science and Technology, Tsukuba Central 2, 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, Japan.
The quantum mechanics/molecular mechanics (QM/MM) method is a powerful approach for investigating solid surfaces in contact with various types of media, since it allows for flexible modeling of complex interfaces while maintaining an all-atom representation. The mean-field QM/MM method is an average reaction field model within the QM/MM framework. The method addresses the challenges associated with the statistical sampling of interfacial atomic configurations of a medium and enables efficient calculation of free energies.
View Article and Find Full Text PDFMol Divers
September 2025
Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, Nanjing, 211198, China.
Drug absorption significantly influences pharmacokinetics. Accurately predicting human oral bioavailability (HOB) is essential for optimizing drug candidates and improving clinical success rates. The traditional method based on experiment is a common way to obtain HOB, but the experimental method is time-consuming and costly.
View Article and Find Full Text PDF