Cross-platform DNA motif discovery and benchmarking to explore binding specificities of poorly studied human transcription factors.

Ilya E Vorontsov , Ivan Kozin , Sergey Abramov , Alexandr Boytsov , Arttu Jolma , Mihai Albu , Giovanna Ambrosini , Katerina Faltejskova , Antoni J Gralak , Nikita Gryzunov , Sachi Inukai , Semyon Kolmykov , Pavel Kravchenko , Judith F Kribelbauer-Swietek , Kaitlin U Laverty , Vladimir Nozdrin , Zain M Patel , Dmitry Penzar , Marie-Luise Plescher , Sara E Pour

bioRxiv

Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia.

Published: November 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

A DNA sequence pattern, or "motif", is an essential representation of DNA-binding specificity of a transcription factor (TF). Any particular motif model has potential flaws due to shortcomings of the underlying experimental data and computational motif discovery algorithm. As a part of the Codebook/GRECO-BIT initiative, here we evaluated at large scale the cross-platform recognition performance of positional weight matrices (PWMs), which remain popular motif models in many practical applications. We applied ten different DNA motif discovery tools to generate PWMs from the "Codebook" data comprised of 4,237 experiments from five different platforms profiling the DNA-binding specificity of 394 human proteins, focusing on understudied transcription factors of different structural families. For many of the proteins, there was no prior knowledge of a genuine motif. By benchmarking-supported human curation, we constructed an approved subset of experiments comprising about 30% of all experiments and 50% of tested TFs which displayed consistent motifs across platforms and replicates. We present the Codebook Motif Explorer (https://mex.autosome.org), a detailed online catalog of DNA motifs, including the top-ranked PWMs, and the underlying source and benchmarking data. We demonstrate that in the case of high-quality experimental data, most of the popular motif discovery tools detect valid motifs and generate PWMs, which perform well both on genomic and synthetic data. Yet, for each of the algorithms, there were problematic combinations of proteins and platforms, and the basic motif properties such as nucleotide composition and information content offered little help in detecting such pitfalls. By combining multiple PMWs in decision trees, we demonstrate how our setup can be readily adapted to train and test binding specificity models more complex than PWMs. Overall, our study provides a rich motif catalog as a solid baseline for advanced models and highlights the power of the multi-platform multi-tool approach for reliable mapping of DNA binding specificities.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11601219	PMC
http://dx.doi.org/10.1101/2024.11.11.619379	DOI Listing

Publication Analysis

Top Keywords

motif discovery

motif

dna motif

binding specificities

transcription factors

dna-binding specificity

experimental data

popular motif

discovery tools

generate pwms

Similar Publications

Machine Learning-Aided Screening and Design Rule Discovery for LWIR-Transparent Optical Materials.

J Chem Inf Model

September 2025

Department of Chemistry and Biochemistry, University of Arizona, Tucson, Arizona 85721-0041, United States.

Maliheh Shaban Tameh

The development of low-cost, high-performance materials with enhanced transparency in the long-wavelength infrared (LWIR) region (800-1250 cm/8-12.5 μm) is essential for advancing thermal imaging and sensing technologies. Traditional LWIR optics rely on costly inorganic materials, limiting their broader deployment.

View Article and Find Full Text PDF

Similar Publications

A clinical and genotype-phenotype analysis of MACF1 variants.

Am J Hum Genet

September 2025

Department of Clinical Genetics, Erasmus MC, University Medical Center Rotterdam, PO Box 2040, Rotterdam 3000 CA, the Netherlands.

Jordy Dekker , Rachel Schot , Kimberly A Aldinger , David B Everman , Camerun Washington

Microtubule-actin cross-linking factor 1 (MACF1) is a large protein of the spectraplakin family, which is essential for brain development. MACF1 interacts with microtubules through the growth arrest-specific 2 (Gas2)-related (GAR) domain. Heterozygous MACF1 missense variants affecting the zinc-binding residues in this domain result in a distinctive cortical and brain stem malformation.

View Article and Find Full Text PDF

Similar Publications

TissueMosaic: Self-supervised learning of tissue representations enables differential spatial transcriptomics across samples.

Cell Syst

September 2025

Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. Electronic address:

Sandeep Kambhampati , Luca D'Alessio , Fedor Grab , Stephen Fleming , Sophia Liu

Spatial transcriptomics allows for the measurement of gene expression within the native tissue context. However, despite technological advancements, computational methods to link cell states with their microenvironment and compare these relationships across samples and conditions remain limited. To address this, we introduce Tissue Motif-Based Spatial Inference across Conditions (TissueMosaic), a self-supervised convolutional neural network designed to discover and represent tissue architectural motifs from multi-sample spatial transcriptomic datasets.

View Article and Find Full Text PDF

Similar Publications

Discovery and phylogeny of a ricin-B-like domain from rice.

Carbohydr Res

September 2025

Laboratory for Biochemistry & Glycobiology, Ghent University, Department of Biotechnology, Ghent, Belgium. Electronic address:

Tibo De Coninck , Pierre Rougé , Els J M Van Damme

Lectins are carbohydrate-binding proteins which play key roles in various biological processes, including cell signaling, pathogen recognition and development. Previous research conducted on ricin-B lectin domains and carbohydrate-binding modules of family 13 (CBM13) illustrated the striking resemblances between these two groups of protein domains. In this study, we report on the discovery, identification and putative biochemical characteristics of a ricin-B-like domain that is unique for GH27 enzymes from land plants, identified in the OsAPSE enzyme from Japanese rice (Oryza sativa L.

View Article and Find Full Text PDF

Similar Publications

A Conserved Cia1-Cia2 Interface Mediates Client Recruitment in the Cytosolic Iron-Sulfur Cluster Assembly Pathway.

J Am Chem Soc

September 2025

Department of Chemistry, Boston University, 590 Commonwealth Ave, Boston, Massachusetts 02215, United States.

Anastasiya Buzuk , Melissa D Marquez , Jackson V Ho , Yaxi Liu , Beatrice Wang

The cytosolic iron-sulfur cluster assembly (CIA) targeting complex maturates over 30 cytosolic and nuclear Fe-S proteins, raising the question of how a single complex recognizes such a diverse set of clients. The discovery of a C-terminal targeting complex recognition (TCR) peptide in up to 25% of CIA clients provided a clue to substrate specificity, yet the molecular and energetic basis for this interaction remained unresolved. By integrating computational and biochemical approaches, we show that the TCR peptide binds a conserved interface between the Cia1 and Cia2 subunits of the targeting complex, even in the absence of the Fe-S cluster.

View Article and Find Full Text PDF

Similar Publications