Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question.

Results: Here, we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation for six major functional genomics prediction tasks. Our findings suggest that probing the representations of current pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. Nevertheless, highly tuned supervised models trained from scratch using one-hot encoded sequences can achieve performance competitive with or better than pre-trained models across the datasets explored in this study.

Discussion: This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12261763PMC
http://dx.doi.org/10.1186/s13059-025-03674-8DOI Listing

Publication Analysis

Top Keywords

pre-trained glms
12
representational power
8
power pre-trained
8
language models
8
regulatory genomics
8
models glms
8
non-coding genome
8
functional genomics
8
one-hot encoded
8
encoded sequences
8

Similar Publications

Background: The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question.

View Article and Find Full Text PDF

Background: In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research.

View Article and Find Full Text PDF

Motivation: Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs.

View Article and Find Full Text PDF

The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of -regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of -regulatory biology remains an open question.

View Article and Find Full Text PDF