Evaluating the representational power of pre-trained DNA language models for regulatory genomics.

Ziqi Tang , Nirali Somia , Yiyang Yu , Peter K Koo

Genome Biol

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question.

Results: Here, we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation for six major functional genomics prediction tasks. Our findings suggest that probing the representations of current pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. Nevertheless, highly tuned supervised models trained from scratch using one-hot encoded sequences can achieve performance competitive with or better than pre-trained models across the datasets explored in this study.

Discussion: This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12261763	PMC
http://dx.doi.org/10.1186/s13059-025-03674-8	DOI Listing

Publication Analysis

Top Keywords

pre-trained glms

representational power

power pre-trained

language models

regulatory genomics

models glms

non-coding genome

functional genomics

one-hot encoded

encoded sequences

Similar Publications

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.

Genome Biol

July 2025

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.

Ziqi Tang , Nirali Somia , Yiyang Yu , Peter K Koo

View Article and Find Full Text PDF

Similar Publications

Extracting lung cancer staging descriptors from pathology reports: A generative language model approach.

J Biomed Inform

September 2024

Department of Pathology, Seoul National University College of Medicine, Seoul, Republic of Korea; Department of Pathology and Translational Medicine Seoul National University Bundang Hospital, Seongnam, Republic of Korea. Electronic address:

Hyeongmin Cho , Sooyoung Yoo , Borham Kim , Sowon Jang , Leonard Sunwoo

Background: In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research.

View Article and Find Full Text PDF

Similar Publications

Are genomic language models all you need? Exploring genomic language models on protein downstream tasks.

Bioinformatics

September 2024

InstaDeep, Cambridge, MA 02142, United States.

Sam Boshar , Evan Trop , Bernardo P de Almeida , Liviu Copoiu , Thomas Pierrot

Motivation: Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs.

View Article and Find Full Text PDF

Similar Publications

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.

bioRxiv

September 2024

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA.

Ziqi Tang , Nirali Somia , Yiyang Yu , Peter K Koo

The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of -regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of -regulatory biology remains an open question.

View Article and Find Full Text PDF

Similar Publications