Molecular Assays Simulator to Unravel Predictors Hacking in Goal-Directed Molecular Generations.

Philippe Gendreau , Joseph-André Turk , Nicolas Drizard , Vinicius Barros Ribeiro da Silva , Clarisse Descamps , Yann Gaston-Mathé

J Chem Inf Model

Iktos, 65 rue de Prony, 75017, Paris, France.

Published: July 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Generative models are being increasingly used in drug discovery, very often coupled with absorption, distribution, metabolism, and excretion (ADME) bioassays or quantitative structure-activity relationship (QSAR) models to optimize a given set of properties. The molecules proposed by these algorithms are often revealed to be false positives; that is, they are predicted to be active and turn out to be inactive after synthesis and testing, mostly due to overoptimization of the predicted scores, which leads to an actual decrease or stagnation of the real scores. This behavior is also known as the "hacking" of the predictive models by the generative model during the optimization step. This issue is reminiscent of adversarial examples in machine learning and it can be seen as enunciated by Goodhart's law: This issue is even more apparent in a multiparameter optimization (MPO) case, where the models need to extrapolate outside the training set distribution because there are no known molecules satisfying all the objectives simultaneously in the initial training set. Experimental evaluation of this problem is a hard and expensive task since it requires synthesis and testing of the generated molecules. Thus, efforts have been made to develop in silico "oracles"─real-valued functions used as proxies for molecular properties─to help with the evaluation of these generative-model-based pipelines. However, these oracles have had a limited value so far because they are often too easy to model in comparison with biological assays and are usually limited to mono-objective cases. In this work, we introduce a simulator of multitarget assays using a smartly initialized neural network (NN) that returns continuous values for any input molecule. We use this oracle to replicate a real-world prospective lead optimization (LO) scenario. First, we trained predictive models on an initial small sample of molecules aimed at predicting their oracle values. Afterward, we generated new optimized molecules using the open-source GuacaMol package coupled with the previously built predictive models. Finally, we selected compounds matching the candidate drug target profile (CDTP) according to the predicted values and evaluated them by computing the true oracle values. We observed that even when the predictive models had excellent estimated performance metrics, the final selection still contained multiple false positives according to the NN-based oracle. Then, we evaluated the optimization behavior in mono- and bi-objective scenarios using either a logistic regression or a random forest predictive model. We also propose and evaluate several methods to help mitigate the hacking issue.

Download full-text PDF	Source
http://dx.doi.org/10.1021/acs.jcim.3c00195	DOI Listing

Publication Analysis

Top Keywords

predictive models

false positives

synthesis testing

training set

oracle values

models

molecules

predictive

molecular assays

assays simulator

Similar Publications

A novel multimodal framework combining habitat radiomics, deep learning, and conventional radiomics for predicting MGMT gene promoter methylation in Glioma: Superior performance of integrated models.

Eur J Radiol

September 2025

Department of Radiology, Affiliated Hospital of Hebei University, Baoding 071000, China. Electronic address:

Feng-Ying Zhu , Wen-Jing Chen , Hao-Yan Chen , Si-Yu Ren , Li-Yong Zhuo

Purpose: The present study aimed to develop a noninvasive predictive framework that integrates clinical data, conventional radiomics, habitat imaging, and deep learning for the preoperative stratification of MGMT gene promoter methylation in glioma.

Materials And Methods: This retrospective study included 410 patients from the University of California, San Francisco, USA, and 102 patients from our hospital. Seven models were constructed using preoperative contrast-enhanced T1-weighted MRI with gadobenate dimeglumine as the contrast agent.

View Article and Find Full Text PDF

Similar Publications

Combined SHR and SIRI biomarkers predict increased coronary heart disease risk in type 2 diabetes.

Biomol Biomed

September 2025

Department of Cardiology, Renmin Hospital of Wuhan University, Wuhan, China.

Zixuan Guo , Siqi Song , Hao Cheng , Changxu Xie , Meng Zhang

Coronary heart disease (CHD) is a leading cause of morbidity and mortality; patients with type 2 diabetes mellitus (T2DM) are at particularly high risk, highlighting the need for reliable biomarkers for early detection and risk stratification. We investigated whether combining the stress hyperglycemia ratio (SHR) and systemic inflammation response index (SIRI) improves CHD detection in T2DM. In this retrospective cohort of 943 T2DM patients undergoing coronary angiography, associations of SHR and SIRI with CHD were evaluated using multivariable logistic regression and restricted cubic splines; robustness was examined with subgroup and sensitivity analyses.

View Article and Find Full Text PDF

Similar Publications

Designing Buchwald-Hartwig Reaction Graph for Yield Prediction.

J Org Chem

September 2025

State Key Laboratory of Fine Chemicals, School of Chemical Engineering, Ocean and Life Sciences, Dalian University of Technology, Panjin 124221, P. R. China.

Weiren Zhao , Shen Wang , Yang Li

The Buchwald-Hartwig (B-H) reaction graph, a novel graph for deep learning models, is designed to simulate the interactions among multiple chemical components in the B-H reaction by representing each reactant as an individual node within a custom-designed reaction graph, thereby capturing both single-molecule and intermolecular relationship features. Trained on a high-throughput B-H reaction data set, B-H Reaction Graph Neural Network (BH-RGNN) achieves near-state-of-the-art performance with an score of 0.971 while maintaining low computational costs.

View Article and Find Full Text PDF

Similar Publications

Health Care Professionals' Experiences Regarding Facilitators of and Barriers to Sustained Use of Social Robot Ivy for People With Intellectual Disabilities: Qualitative Interview Study.

J Med Internet Res

September 2025

School of Advertising, Marketing and Public Relations, Faculty of Business and Law, Queensland University of Technology, Brisbane, Australia.

Mark Steins , Claire Huijnen , Gaby Odekerken-Schröder , Dominik Mahr , Kars Mennens

Background: Labor shortages in health care pose significant challenges to sustaining high-quality care for people with intellectual disabilities. Social robots show promise in supporting both people with intellectual disabilities and their health care professionals; yet, few are fully developed and embedded in productive care environments. Implementation of such technologies is inherently complex, requiring careful examination of facilitators and barriers influencing sustained use.

View Article and Find Full Text PDF

Similar Publications

Predicting HOMO-LUMO Gaps Using Hartree-Fock Calculated Data and Machine Learning Models.

J Chem Inf Model

September 2025

Department of Chemistry, Delaware State University, Dover, Delaware 19901, United States.

Md Mehedi Hasan , Omid Tarkhaneh , Sharene D Bungay , Raymond A Poirier , Shahidul M Islam

The calculation of the highest occupied molecular orbital-lowest unoccupied molecular orbital (HOMO-LUMO) gap for chemical molecules is computationally intensive using quantum mechanics (QM) methods, while experimental determination is often costly and time-consuming. Machine Learning (ML) offers a cost-effective and rapid alternative, enabling efficient predictions of HOMO-LUMO gap values across large data sets without the need for extensive QM computations or experiments. ML models facilitate the screening of diverse molecules, providing valuable insights into complex chemical spaces and integrating seamlessly into high-throughput workflows to prioritize candidates for experimental validation.

View Article and Find Full Text PDF

Similar Publications