Publications by Dmitry Penzar | LitMetric

Publications by authors named "Dmitry Penzar"

Page 1 of 1

GAME: Genomic API for Model Evaluation.

Ishika Luthra , Satyam Priyadarshi , Rui Guo , Lukas Mahieu , Niklas Kempynck , Dmitry Penzar

bioRxiv

July 2025

The rapid expansion of genomics datasets and the application of machine learning has produced sequence-to-activity genomics models with ever-expanding capabilities. However, benchmarking these models on practical applications has been challenging because individual projects evaluate their models in ad hoc ways, and there is substantial heterogeneity of both model architectures and benchmarking tasks. To address this challenge, we have created GAME, a system for large-scale, community-led standardized model benchmarking on user-defined evaluation tasks.

View Article and Find Full Text PDF

GENA-LM: a family of open-source foundational DNA language models for long sequences.

Veniamin Fishman , Yuri Kuratov , Aleksei Shmelev , Maxim Petrov , Dmitry Penzar

Nucleic Acids Res

January 2025

Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA language model (GENA-LM), a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 000 base pairs.

View Article and Find Full Text PDF

Massively parallel characterization of transcriptional regulatory elements.

Vikram Agarwal , Fumitaka Inoue , Max Schubach , Dmitry Penzar , Beth K Martin

Nature

March 2025

The human genome contains millions of candidate cis-regulatory elements (cCREs) with cell-type-specific activities that shape both health and many disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these cCREs. Here we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of more than 680,000 sequences, representing an extensive set of annotated cCREs among three cell types (HepG2, K562 and WTC11), and found that 41.

View Article and Find Full Text PDF

A generative framework for enhanced cell-type specificity in rationally designed mRNAs.

Matvei Khoroshkin , Arsenii Zinkevich , Elizaveta Aristova , Hassan Yousefi , Sean B Lee , Dmitry Penzar

bioRxiv

December 2024

mRNA delivery offers new opportunities for disease treatment by directing cells to produce therapeutic proteins. However, designing highly stable mRNAs with programmable cell type-specificity remains a challenge. To address this, we measured the regulatory activity of 60,000 5' and 3' untranslated regions (UTRs) across six cell types and developed PARADE (Prediction And RAtional DEsign of mRNA UTRs), a generative AI framework to engineer untranslated RNA regions with tailored cell type-specific activity.

View Article and Find Full Text PDF

Cross-platform DNA motif discovery and benchmarking to explore binding specificities of poorly studied human transcription factors.

Ilya E Vorontsov , Ivan Kozin , Sergey Abramov , Alexandr Boytsov , Arttu Jolma , Dmitry Penzar

bioRxiv

November 2024

A DNA sequence pattern, or "motif", is an essential representation of DNA-binding specificity of a transcription factor (TF). Any particular motif model has potential flaws due to shortcomings of the underlying experimental data and computational motif discovery algorithm. As a part of the Codebook/GRECO-BIT initiative, here we evaluated at large scale the cross-platform recognition performance of positional weight matrices (PWMs), which remain popular motif models in many practical applications.

View Article and Find Full Text PDF

A community effort to optimize sequence-based deep learning models of gene regulation.

Abdul Muntakim Rafi , Daria Nogina , Dmitry Penzar , Dohoon Lee , Danyeong Lee

Nat Biotechnol

August 2025

Article Synopsis

A systematic evaluation is necessary to understand how different model architectures and training strategies affect the performance of genomics models, prompting the organization of a DREAM Challenge.
In the challenge, competitors used a vast dataset of yeast DNA sequences and expression levels to train models, with the best models employing various neural network architectures and training approaches.
The development of the Prix Fixe framework allowed for an in-depth analysis of these models, leading to improved performance, and demonstrating that top models not only excelled on yeast data but also outperformed existing benchmarks in Drosophila and human datasets.

View Article and Find Full Text PDF

PhyloBench: A Benchmark for Evaluating Phylogenetic Programs.

Sergey Spirin , Andrey Sigorskikh , Aleksei Efremov , Dmitry Penzar , Anna Karyagina

Mol Biol Evol

June 2024

Phylogenetic inference based on protein sequence alignment is a widely used procedure. Numerous phylogenetic algorithms have been developed, most of which have many parameters and options. Choosing a program, options, and parameters can be a nontrivial task.

View Article and Find Full Text PDF

Ribonanza: deep learning of RNA structure through dual crowdsourcing.

Shujun He , Rui Huang , Jill Townley , Rachael C Kretsch , Thomas G Karagianes , Dmitry Penzar

bioRxiv

June 2024

Article Synopsis

The prediction of RNA structure from its sequence is challenging due to a lack of experimental data, which has slowed advancement in the field.
Researchers have developed a dataset called Ribonanza, consisting of chemical mapping data from two million RNA sequences, collected through crowdsourcing platforms like Eterna.
Utilizing this dataset, they created a deep learning model named RibonanzaNet, which, when fine-tuned, demonstrates superior performance in predicting various RNA behaviors, potentially improving understanding of RNA structures.

View Article and Find Full Text PDF

Evaluation and optimization of sequence-based gene regulatory deep learning models.

Abdul Muntakim Rafi , Daria Nogina , Dmitry Penzar , Dohoon Lee , Danyeong Lee

bioRxiv

February 2024

Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression.

View Article and Find Full Text PDF

LegNet: a best-in-class deep learning model for short DNA regulatory regions.

Dmitry Penzar , Daria Nogina , Elizaveta Noskova , Arsenii Zinkevich , Georgy Meshcheryakov

Bioinformatics

August 2023

Motivation: The increasing volume of data from high-throughput experiments including parallel reporter assays facilitates the development of complex deep-learning approaches for modeling DNA regulatory grammar.

Results: Here, we introduce LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. By approaching the sequence-to-expression regression problem as a soft classification task, LegNet secured first place for the autosome.

View Article and Find Full Text PDF

Investigation of the Role of PUFA Metabolism in Breast Cancer Using a Rank-Based Random Forest Algorithm.

Mariia V Guryleva , Dmitry D Penzar , Dmitry V Chistyakov , Andrey A Mironov , Alexander V Favorov

Cancers (Basel)

September 2022

Article Synopsis

* A study utilized a machine learning approach to identify and classify important PUFA metabolism genes altered in breast cancer tissues compared to normal tissues, achieving high accuracy in predictions.
* The research found that PUFA metabolism genes are crucial for breast cancer development, and different molecular subtypes of breast cancer exhibit distinct patterns in the expression of these genes.

View Article and Find Full Text PDF

Landscape of allele-specific transcription factor binding in the human genome.

Sergey Abramov , Alexandr Boytsov , Daria Bykova , Dmitry D Penzar , Ivan Yevshin

Nat Commun

May 2021

Sequence variants in gene regulatory regions alter gene expression and contribute to phenotypes of individual cells and the whole organism, including disease susceptibility and progression. Single-nucleotide variants in enhancers or promoters may affect gene transcription by altering transcription factor binding sites. Differential transcription factor binding in heterozygous genomic loci provides a natural source of information on such regulatory variants.

View Article and Find Full Text PDF

Insights gained from a comprehensive all-against-all transcription factor binding motif benchmarking study.

Giovanna Ambrosini , Ilya Vorontsov , Dmitry Penzar , Romain Groux , Oriol Fornes

Genome Biol

May 2020

Background: Positional weight matrix (PWM) is a de facto standard model to describe transcription factor (TF) DNA binding specificities. PWMs inferred from in vivo or in vitro data are stored in many databases and used in a plethora of biological applications. This calls for comprehensive benchmarking of public PWM models with large experimental reference sets.

View Article and Find Full Text PDF

What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants.

Dmitry D Penzar , Arsenii O Zinkevich , Ilya E Vorontsov , Vasily V Sitnik , Alexander V Favorov

Front Genet

October 2019

Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants.

View Article and Find Full Text PDF

H3K4me3, H3K9ac, H3K27ac, H3K27me3 and H3K9me3 Histone Tags Suggest Distinct Regulatory Evolution of Open and Condensed Chromatin Landmarks.

Anna A Igolkina , Arsenii Zinkevich , Kristina O Karandasheva , Aleksey A Popov , Maria V Selifanova , Dmitry Penzar

Cells

September 2019

Background: Transposons are selfish genetic elements that self-reproduce in host DNA. They were active during evolutionary history and now occupy almost half of mammalian genomes. Close insertions of transposons reshaped structure and regulation of many genes considerably.

View Article and Find Full Text PDF

Correction: Nikitin, D., et al. Retroelement-Linked Transcription Factor Binding Patterns Point to Quickly Developing Molecular Pathways in Human Evolution. 2019, , 130.

Daniil Nikitin , Andrew Garazha , Maxim Sorokin , Dmitry Penzar , Victor Tkachev

Cells

August 2019

In the article 'Retroelement-Linked Transcription Factor Binding Patterns Point to Quickly Developing Molecular Pathways in Human Evolution,' a number of transcription factor binding sites (TFBS) mapped on all retroelement classes were incorrectly calculated as sum of TFBS numbers separately mapped on LINEs, SINEs and LTR retrotransposons/endogenous retroviruses (LR/ERVs) [...

View Article and Find Full Text PDF

Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay.

Dustin Shigaki , Orit Adato , Aashish N Adhikari , Shengcheng Dong , Alex Hawkins-Hooker , Dmitry D Penzar

Hum Mutat

September 2019

The integrative analysis of high-throughput reporter assays, machine learning, and profiles of epigenomic chromatin state in a broad array of cells and tissues has the potential to significantly improve our understanding of noncoding regulatory element function and its contribution to human disease. Here, we report results from the CAGI 5 regulation saturation challenge where participants were asked to predict the impact of nucleotide substitution at every base pair within five disease-associated human enhancers and nine disease-associated promoters. A library of mutations covering all bases was generated by saturation mutagenesis and altered activity was assessed in a massively parallel reporter assay (MPRA) in relevant cell lines.

View Article and Find Full Text PDF

Retroelement-Linked Transcription Factor Binding Patterns Point to Quickly Developing Molecular Pathways in Human Evolution.

Daniil Nikitin , Andrew Garazha , Maxim Sorokin , Dmitry Penzar , Victor Tkachev

Cells

February 2019

Retroelements (REs) are transposable elements occupying ~40% of the human genome that can regulate genes by providing transcription factor binding sites (TFBS). RE-linked TFBS profile can serve as a marker of gene transcriptional regulation evolution. This approach allows for interrogating the regulatory evolution of organisms with RE-rich genomes.

View Article and Find Full Text PDF

PQ, a new program for phylogeny reconstruction.

Dmitry Penzar , Mikhail Krivozubov , Sergey Spirin

BMC Bioinformatics

October 2018

Background: Many algorithms and programs are available for phylogenetic reconstruction of families of proteins. Methods used widely at present use either a number of distance-based principles or character-based principles of maximum parsimony or maximum likelihood.

Results: We developed a novel program, named PQ, for reconstructing protein and nucleic acid phylogenies following a new character-based principle.

View Article and Find Full Text PDF

Profiling of Human Molecular Pathways Affected by Retrotransposons at the Level of Regulation by Transcription Factor Proteins.

Daniil Nikitin , Dmitry Penzar , Andrew Garazha , Maxim Sorokin , Victor Tkachev

Front Immunol

February 2019

Endogenous retroviruses and retrotransposons also termed retroelements (REs) are mobile genetic elements that were active until recently in human genome evolution. REs regulate gene expression by actively reshaping chromatin structure or by directly providing transcription factor binding sites (TFBSs). We aimed to identify molecular processes most deeply impacted by the REs in human cells at the level of TFBS regulation.

View Article and Find Full Text PDF