98%
921
2 minutes
20
String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencing embedded strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation called selfies. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints, and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of selfies, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of selfies (version 2.1.1) in this manuscript. Our library, selfies, is available at GitHub (https://github.com/aspuru-guzik-group/selfies).
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10408573 | PMC |
http://dx.doi.org/10.1039/d3dd00044c | DOI Listing |
Chemistry
August 2025
IMDEA Nanociencia, C/ Faraday, 9, Madrid, 28049, Spain.
The development of responsive materials for monitoring atmospheric toxic emissions is a growing area of interest. Luminescent metal-organic framework (LMOF), particularly those based on lanthanides (LnMOFs), have emerged as promising candidates due to their sharp emission bands, long luminescence lifetimes, and structural versatility. Despite their potential, the integration of LnMOFs into robust solid-state sensing platforms remains limited.
View Article and Find Full Text PDFJ Comput Aided Mol Des
July 2025
School of Medical Information and Engineering, Southwest Medical University, Luzhou, Sichuan Province, 646000, China.
In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature.
View Article and Find Full Text PDFSci Rep
July 2025
Department of Chemical & Petroleum Engineering, Khalifa University of Science and Technology, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
Accurate molecular property prediction requires input representations that preserve substructural details and maintain syntactic consistency. SMILES (Simplified Molecular Input Line Entry System), while widely used, does not guarantee validity and allows multiple representations for the same compound. SELFIES (Self-Referencing Embedded Strings) addresses these limitations through a robust grammar that ensures structural validity.
View Article and Find Full Text PDFBrief Bioinform
May 2025
Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, 3663 North Zhongshan Road, Putuo District, Shanghai 200062, China.
Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robustness remain limited by the challenges in molecular representation methods. Current methods typically treat reactants and products as independent molecules, overlooking structural relationships crucial for accurate synthesis predictions.
View Article and Find Full Text PDFJ Chem Inf Model
February 2025
Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California 94158
Researchers are developing increasingly robust molecular representations, motivating the need for thorough methods to stress-test and validate them. Here, we use a variational auto-encoder (VAE), an unsupervised deep learning model, to generate anomalous examples of SELF-referencIng Embedded Strings (SELFIES), a popular molecular string format. These anomalies defy the assertion that all SELFIES convert into valid SMILES strings.
View Article and Find Full Text PDF