98%
921
2 minutes
20
Researchers are developing increasingly robust molecular representations, motivating the need for thorough methods to stress-test and validate them. Here, we use a variational auto-encoder (VAE), an unsupervised deep learning model, to generate anomalous examples of SELF-referencIng Embedded Strings (SELFIES), a popular molecular string format. These anomalies defy the assertion that all SELFIES convert into valid SMILES strings. Interestingly, we find specific regions within the VAE's internal landscape (latent space), whose decoding frequently generates inconvertible SELFIES anomalies. The model's internal landscape self-organization helps with exploring factors affecting molecular representation reliability. We show how VAEs and similar anomaly generation methods can empirically stress-test molecular representation robustness. Additionally, we investigate reasons for the invalidity of some discovered SELFIES strings (version 2.1.1) and suggest changes to improve them, aiming to spark ongoing molecular representation improvement.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11863373 | PMC |
http://dx.doi.org/10.1021/acs.jcim.4c01876 | DOI Listing |
Mol Divers
September 2025
Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, Nanjing, 211198, China.
Drug absorption significantly influences pharmacokinetics. Accurately predicting human oral bioavailability (HOB) is essential for optimizing drug candidates and improving clinical success rates. The traditional method based on experiment is a common way to obtain HOB, but the experimental method is time-consuming and costly.
View Article and Find Full Text PDFFront Vet Sci
August 2025
Pathobiology and Population Science, Royal Veterinary College, Hatfield, United Kingdom.
Diffuse large B-cell lymphoma is the most common type of non-Hodgkin lymphoma (NHL) in humans, accounting for about 30-40% of NHL cases worldwide. Canine diffuse large B-cell lymphoma (cDLBCL) is the most common lymphoma subtype in dogs and demonstrates an aggressive biologic behaviour. For tissue biopsies, current confirmatory diagnostic approaches for enlarged lymph nodes rely on expert histopathological assessment, which is time-consuming and requires specialist expertise.
View Article and Find Full Text PDFBioinformatics
September 2025
Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, 2200, Denmark.
Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.
Results: We leveraged the STRING database of protein networks and orthology relations for 1,322 eukaryotes to generate network-based cross-species protein embeddings.
Brief Bioinform
August 2025
School of Computer Science, Xi'an Polytechnic University, 710048, Xi'an, China.
Cancer, with its inherent heterogeneity, is commonly categorized into distinct subtypes based on unique traits, cellular origins, and molecular markers specific to each type. However, current studies primarily rely on complete multi-omics datasets for predicting cancer subtypes, often overlooking predictive performance in cases where some omics data may be missing and neglecting implicit relationships across multiple layers of omics data integration. This paper introduces Multi-Layer Matrix Factorization (MLMF), a novel approach for cancer subtyping that employs multi-omics data clustering.
View Article and Find Full Text PDFActa Trop
September 2025
Instituto de Ciências Biomédicas, Universidade de São Paulo - ICB5/USP, Monte Negro, RO, Brazil; Instituto Nacional de Epidemiologia da Amazônia Ocidental - INCT-EpiAmO, Porto Velho, RO, Brazil; Centro de Pesquisas em Medicina Tropical - CEPEM, Porto Velho, RO, Brazil; Laboratório de Medicina T
This study evaluated the richness and abundance of ticks collected during two years in forest fragments of the state of Acre, western Brazilian Amazon. Considering all the environmental and host collections, the following 15 tick species were collected: Amblyomma coelebs, Amblyomma crassum, Amblyomma humerale, Amblyomma latepunctatum, Amblyomma longirostre, Amblyomma naponense, Amblyomma nodosum, Amblyomma oblongoguttatum, Amblyomma ovale, Amblyomma pacae, Amblyomma rotundatum, Amblyomma scalpturatum, Haemaphysalis juxtakochi, Ixodes luciae and Rhipicephalus microplus. Data from the most two abundant tick species, A.
View Article and Find Full Text PDF