Recent advances in the self-referencing embedded strings (SELFIES) library.

Alston Lo , Robert Pollice , AkshatKumar Nigam , Andrew D White , Mario Krenn , Alán Aspuru-Guzik

Digit Discov

Department of Computer Science, University of Toronto Canada

Published: August 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencing embedded strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation called selfies. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints, and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of selfies, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of selfies (version 2.1.1) in this manuscript. Our library, selfies, is available at GitHub (https://github.com/aspuru-guzik-group/selfies).

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10408573	PMC
http://dx.doi.org/10.1039/d3dd00044c	DOI Listing

Publication Analysis

Top Keywords

self-referencing embedded

embedded strings

strings selfies

selfies

advances self-referencing

selfies library

library string-based

string-based molecular

molecular representations

representations play

Similar Publications

Solid-State Luminescent Responsive Bilayer Architecture Based on 2D Lanthanide MOFs for Ratiometric NO and SO Monitoring.

Chemistry

August 2025

IMDEA Nanociencia, C/ Faraday, 9, Madrid, 28049, Spain.

Jorge Sangrador-Pérez , Patricia Jiménez-Hernández , Esther Resines-Urien , E Carolina Sañudo , Roberta Poloni

The development of responsive materials for monitoring atmospheric toxic emissions is a growing area of interest. Luminescent metal-organic framework (LMOF), particularly those based on lanthanides (LnMOFs), have emerged as promising candidates due to their sharp emission bands, long luminescence lifetimes, and structural versatility. Despite their potential, the integration of LnMOFs into robust solid-state sensing platforms remains limited.

View Article and Find Full Text PDF

Similar Publications

Comparison study of dominant molecular sequence representation based on diffusion model.

J Comput Aided Mol Des

July 2025

School of Medical Information and Engineering, Southwest Medical University, Luzhou, Sichuan Province, 646000, China.

Yongrui Cui , Dongjing Shan , Qiheng Lu , Beijia Zou , Huali Zhang

In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature.

View Article and Find Full Text PDF

Similar Publications

Domain adaptation of a SMILES chemical transformer to SELFIES with limited computational resources.

Sci Rep

July 2025

Department of Chemical & Petroleum Engineering, Khalifa University of Science and Technology, P.O. Box 127788, Abu Dhabi, United Arab Emirates.

Obaid Khaleifah Alhmoudi , Mahmoud Aboushanab , Muhammed Thameem , Ali Elkamel , Ali A AlHammadi

Accurate molecular property prediction requires input representations that preserve substructural details and maintain syntactic consistency. SMILES (Simplified Molecular Input Line Entry System), while widely used, does not guarantee validity and allows multiple representations for the same compound. SELFIES (Self-Referencing Embedded Strings) addresses these limitations through a robust grammar that ensures structural validity.

View Article and Find Full Text PDF

Similar Publications

RPSubAlign: a novel sequence-based molecular representation method for retrosynthesis prediction with improved validity and robustness.

Brief Bioinform

May 2025

Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, 3663 North Zhongshan Road, Putuo District, Shanghai 200062, China.

Yuting Hu , Feng Hu , Hongwen Zhang , Hongling Xu , Jixiang Gao

Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robustness remain limited by the challenges in molecular representation methods. Current methods typically treat reactants and products as independent molecules, overlooking structural relationships crucial for accurate synthesis predictions.

View Article and Find Full Text PDF

Similar Publications

Fuzz Testing Molecular Representation Using Deep Variational Anomaly Generation.

J Chem Inf Model

February 2025

Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California 94158

Victor H R Nogueira , Rishabh Sharma , Rafael V C Guido , Michael J Keiser

Researchers are developing increasingly robust molecular representations, motivating the need for thorough methods to stress-test and validate them. Here, we use a variational auto-encoder (VAE), an unsupervised deep learning model, to generate anomalous examples of SELF-referencIng Embedded Strings (SELFIES), a popular molecular string format. These anomalies defy the assertion that all SELFIES convert into valid SMILES strings.

View Article and Find Full Text PDF

Similar Publications