NoiseMol: A noise-robusted data augmentation via perturbing noise for molecular property prediction.

J Mol Graph Model

School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China. Electronic address:

Published: June 2023


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Simplified Molecular-Input Line-Entry System (SMILES) is one of a widely used molecular representation methods for molecular property prediction. We conjecture that all the characters in the SMILES string of a molecule are essential for making up the molecules, but most of them make little contribution to determining a particular property of the molecule. Therefore, we verified the conjecture in the pre-experiment. Motivated by the result, we propose to inject proper noisy information into the SMILES to augment the training data by increasing the diversity of the labeled molecules. To this end, we explore injecting perturbing noise into the original labeled SMILES strings to construct augmented data for alleviating the limitation of the labeled compound data and enhancing the model to extract more useful molecular representation for molecular property prediction. Specifically, we directly adopt mask, swap, deletion, and fusion operations on SMILES strings to randomly mask, swap, and delete atoms in SMILES strings. Then, the augmented data is used by two strategies: each epoch alternately feeds the original and perturbing noisy molecules, or each batch alternately feeds the original and perturbing noisy molecules. We conduct experiments on both Transformer and BiGRU models to validate the effectiveness by adopting widely used datasets from MoleculeNet and ZINC. Experimental results demonstrate that the proposed method outperforms strong baselines on all the datasets. NoiseMol obtains the best performance on BBBP and FDA when compared with state-of-the-art methods. Besides, NoiseMol achieves the best accuracy on LogP. Therefore, injecting perturbing noise into the labeled SMILES strings is an effective and efficient method, which improves the prediction performance, generalization, and robustness of the deep learning models.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jmgm.2023.108454DOI Listing

Publication Analysis

Top Keywords

smiles strings
16
perturbing noise
12
molecular property
12
property prediction
12
molecular representation
8
injecting perturbing
8
labeled smiles
8
augmented data
8
mask swap
8
alternately feeds
8

Similar Publications

Going beyond SMILES enumeration for data augmentation in generative drug discovery.

Digit Discov

August 2025

Institute for Complex Molecular Systems (ICMS), Eindhoven AI Systems Institute (EAISI), Department of Biomedical Engineering, Eindhoven University of Technology Eindhoven The Netherlands

Data augmentation can alleviate the limitations of small molecular datasets for generative deep learning by 'artificially inflating' the number of instances available for training. SMILES enumeration - wherein multiple valid SMILES strings are used to represent the same molecules - has become particularly beneficial to improve the quality of molecule design. Herein, we investigated whether rethinking SMILES augmentation techniques could further enhance the quality of design.

View Article and Find Full Text PDF

Evaluation of chirality descriptors derived from SMILES heteroencoders.

J Cheminform

August 2025

LAQV and REQUIMTE, Chemistry Department, NOVA School of Science and Technology, Universidade Nova de Lisboa, 2829-516, Caparica, Portugal.

Molecular representations of chirality, derived from latent space vectors (LSVs) of SMILES heteroencoders, were explored to train machine learning models to predict chiral properties, and were compared to conventional circular fingerprints. Latent space arithmetic was applied to enhance the representation of chirality, by calculating differences between the original descriptor of a molecule and the descriptor of its enantiomer, or the difference between the original descriptor and the descriptor obtained with the stereochemistry-depleted SMILES string. Machine learning was performed with the Random Forest algorithm applied to a dataset of 3858 molecules extracted from the literature (1929 pairs of enantiomers) to predict the elution order observed on the Chiralpak® AD-H column, as well as intrinsic structural chirality labels (R/S or canonical SMILES @/@@).

View Article and Find Full Text PDF

Systematic benchmarking of 13 AI methods for predicting cyclic peptide membrane permeability.

J Cheminform

August 2025

Bioinformatics Institute, Agency for Science, Technology and Research, 30 Biopilis Street, Singapore, 138671, Singapore.

Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein-protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening.

View Article and Find Full Text PDF

Diabetes remains one of the critical health issues worldwide, and its prevalence is gaining motion due to prevailing factors such as obesity and a sedentary lifestyle. Traditional herbal medications and natural products, particularly enzyme inhibitors, such as alpha-glucosidase, serve as promising alternatives. This study attempted to identify potent alpha-glucosidase inhibitors by including data augmentation in deep-learning modeling.

View Article and Find Full Text PDF

Protein-ligand binding affinity measures the strength of interactions between proteins and ligands. Accurately predicting this value is crucial for drug discovery and estimating enzyme kinetic parameters. In recent years, various computational models based on deep learning algorithms have been developed for predicting protein-ligand binding affinity.

View Article and Find Full Text PDF