Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencing embedded strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation called selfies. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints, and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of selfies, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of selfies (version 2.1.1) in this manuscript. Our library, selfies, is available at GitHub (https://github.com/aspuru-guzik-group/selfies).

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10408573PMC
http://dx.doi.org/10.1039/d3dd00044cDOI Listing

Publication Analysis

Top Keywords

self-referencing embedded
8
embedded strings
8
strings selfies
8
selfies
7
advances self-referencing
4
selfies library
4
library string-based
4
string-based molecular
4
molecular representations
4
representations play
4

Similar Publications

The development of responsive materials for monitoring atmospheric toxic emissions is a growing area of interest. Luminescent metal-organic framework (LMOF), particularly those based on lanthanides (LnMOFs), have emerged as promising candidates due to their sharp emission bands, long luminescence lifetimes, and structural versatility. Despite their potential, the integration of LnMOFs into robust solid-state sensing platforms remains limited.

View Article and Find Full Text PDF

Comparison study of dominant molecular sequence representation based on diffusion model.

J Comput Aided Mol Des

July 2025

School of Medical Information and Engineering, Southwest Medical University, Luzhou, Sichuan Province, 646000, China.

In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature.

View Article and Find Full Text PDF

Accurate molecular property prediction requires input representations that preserve substructural details and maintain syntactic consistency. SMILES (Simplified Molecular Input Line Entry System), while widely used, does not guarantee validity and allows multiple representations for the same compound. SELFIES (Self-Referencing Embedded Strings) addresses these limitations through a robust grammar that ensures structural validity.

View Article and Find Full Text PDF

RPSubAlign: a novel sequence-based molecular representation method for retrosynthesis prediction with improved validity and robustness.

Brief Bioinform

May 2025

Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, 3663 North Zhongshan Road, Putuo District, Shanghai 200062, China.

Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robustness remain limited by the challenges in molecular representation methods. Current methods typically treat reactants and products as independent molecules, overlooking structural relationships crucial for accurate synthesis predictions.

View Article and Find Full Text PDF

Fuzz Testing Molecular Representation Using Deep Variational Anomaly Generation.

J Chem Inf Model

February 2025

Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California 94158

Researchers are developing increasingly robust molecular representations, motivating the need for thorough methods to stress-test and validate them. Here, we use a variational auto-encoder (VAE), an unsupervised deep learning model, to generate anomalous examples of SELF-referencIng Embedded Strings (SELFIES), a popular molecular string format. These anomalies defy the assertion that all SELFIES convert into valid SMILES strings.

View Article and Find Full Text PDF