Fuzz Testing Molecular Representation Using Deep Variational Anomaly Generation.

J Chem Inf Model

Department of Pharmaceutical Chemistry, Department of Bioengineering & Therapeutic Sciences, Institute for Neurodegenerative Diseases, Kavli Institute for Fundamental Neuroscience, Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California 94158

Published: February 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Researchers are developing increasingly robust molecular representations, motivating the need for thorough methods to stress-test and validate them. Here, we use a variational auto-encoder (VAE), an unsupervised deep learning model, to generate anomalous examples of SELF-referencIng Embedded Strings (SELFIES), a popular molecular string format. These anomalies defy the assertion that all SELFIES convert into valid SMILES strings. Interestingly, we find specific regions within the VAE's internal landscape (latent space), whose decoding frequently generates inconvertible SELFIES anomalies. The model's internal landscape self-organization helps with exploring factors affecting molecular representation reliability. We show how VAEs and similar anomaly generation methods can empirically stress-test molecular representation robustness. Additionally, we investigate reasons for the invalidity of some discovered SELFIES strings (version 2.1.1) and suggest changes to improve them, aiming to spark ongoing molecular representation improvement.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11863373PMC
http://dx.doi.org/10.1021/acs.jcim.4c01876DOI Listing

Publication Analysis

Top Keywords

molecular representation
16
anomaly generation
8
internal landscape
8
molecular
6
fuzz testing
4
testing molecular
4
representation
4
representation deep
4
deep variational
4
variational anomaly
4

Similar Publications

Oral bioavailability property prediction based on task similarity transfer learning.

Mol Divers

September 2025

Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, Nanjing, 211198, China.

Drug absorption significantly influences pharmacokinetics. Accurately predicting human oral bioavailability (HOB) is essential for optimizing drug candidates and improving clinical success rates. The traditional method based on experiment is a common way to obtain HOB, but the experimental method is time-consuming and costly.

View Article and Find Full Text PDF

Diffuse large B-cell lymphoma is the most common type of non-Hodgkin lymphoma (NHL) in humans, accounting for about 30-40% of NHL cases worldwide. Canine diffuse large B-cell lymphoma (cDLBCL) is the most common lymphoma subtype in dogs and demonstrates an aggressive biologic behaviour. For tissue biopsies, current confirmatory diagnostic approaches for enlarged lymph nodes rely on expert histopathological assessment, which is time-consuming and requires specialist expertise.

View Article and Find Full Text PDF

SPACE: STRING proteins as complementary embeddings.

Bioinformatics

September 2025

Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, 2200, Denmark.

Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.

Results: We leveraged the STRING database of protein networks and orthology relations for 1,322 eukaryotes to generate network-based cross-species protein embeddings.

View Article and Find Full Text PDF

Cancer, with its inherent heterogeneity, is commonly categorized into distinct subtypes based on unique traits, cellular origins, and molecular markers specific to each type. However, current studies primarily rely on complete multi-omics datasets for predicting cancer subtypes, often overlooking predictive performance in cases where some omics data may be missing and neglecting implicit relationships across multiple layers of omics data integration. This paper introduces Multi-Layer Matrix Factorization (MLMF), a novel approach for cancer subtyping that employs multi-omics data clustering.

View Article and Find Full Text PDF

New tick records in the western Brazilian Amazon, with notes on rickettsial infection and molecular evidence for Amblyomma crassum in Brazil.

Acta Trop

September 2025

Instituto de Ciências Biomédicas, Universidade de São Paulo - ICB5/USP, Monte Negro, RO, Brazil; Instituto Nacional de Epidemiologia da Amazônia Ocidental - INCT-EpiAmO, Porto Velho, RO, Brazil; Centro de Pesquisas em Medicina Tropical - CEPEM, Porto Velho, RO, Brazil; Laboratório de Medicina T

This study evaluated the richness and abundance of ticks collected during two years in forest fragments of the state of Acre, western Brazilian Amazon. Considering all the environmental and host collections, the following 15 tick species were collected: Amblyomma coelebs, Amblyomma crassum, Amblyomma humerale, Amblyomma latepunctatum, Amblyomma longirostre, Amblyomma naponense, Amblyomma nodosum, Amblyomma oblongoguttatum, Amblyomma ovale, Amblyomma pacae, Amblyomma rotundatum, Amblyomma scalpturatum, Haemaphysalis juxtakochi, Ixodes luciae and Rhipicephalus microplus. Data from the most two abundant tick species, A.

View Article and Find Full Text PDF