Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task.

PLoS Comput Biol

School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America.

Published: October 2023


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10597526PMC
http://dx.doi.org/10.1371/journal.pcbi.1011526DOI Listing

Publication Analysis

Top Keywords

protein-coding potential
12
translation task
8
sequence patterns
8
sequence features
8
distinguishing mrnas
8
mrnas lncrnas
8
neural networks
8
improves classification
8
translation
5
improving deep
4

Similar Publications

Blood transcriptomic analysis reveals a distinct molecular subtype of treatment resistant depression compared to non-treatment resistant depression.

Brain Behav Immun

September 2025

Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology (BIST), Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; Biomedical Research Networking Center for Rare Diseases (CIBERER), Barcelona 08003, Spain.

Treatment-resistant depression (TRD) is a severe condition characterized by chronic and recurrent depressive symptoms, leading to significant morbidity and a considerable socio-economic impact. Genetic and biological studies suggest that TRD is associated with distinct biological characteristics. In this study, we analysed whole-transcriptome differences in 293 patients with major depressive disorder (MDD) to compare TRD (N = 150) vs non-TRD (N = 143) cases.

View Article and Find Full Text PDF

Background: The proteome is a valuable resource for pinpointing therapeutic targets. Therefore, we conducted a proteome-wide Mendelian randomization (MR) study aimed at identifying potential protein markers and therapeutic targets for Anti-N-Methyl-D-Aspartate Receptor Encephalitis (NMDAR-E).

Methods: Protein quantitative trait loci (pQTLs) were obtained from seven published genome-wide association studies (GWASs) focusing on the plasma proteome, resulting in summary-level data for 734 circulating protein markers.

View Article and Find Full Text PDF

Hayata 1916 is a unique bamboo species endemic to Taiwan, typically found at elevations ranging from 500 to 1,500 meters. This study provides a detailed analysis of the complete chloroplast genome of for the first time. The genome spans 139,664 base pairs (bp) and consists of a large single-copy (LSC) region of 83,192 bp, a small single-copy (SSC) region of 12,869 bp, and two inverted repeat (IR) regions, each 21,798 bp in length.

View Article and Find Full Text PDF

Morphology and molecular phylogeny of from and .

Mycobiology

September 2025

Division of Environmental Science and Ecological Engineering, College of Life Sciences and Biotechnology, Korea University, Seoul, Korea.

The main objective of the present study is to compile and comprehensively reevaluate all known records of in order to establish a standardized framework for the accurate characterization and identification of this species. Nine isolates of obtained from and from various regions of Korea were analyzed. The morphological features of the fungus and isolated colonies were described and illustrated.

View Article and Find Full Text PDF

Genome imbalance, resulting from varying the dosage of individual chromosomes (aneuploidy), has a more detrimental effect than changes in complete sets of chromosomes (haploidy/polyploidy). This imbalance is likely due to disruptions in stoichiometry and interactions among macromolecular assemblies. Previous research has shown that aneuploidy causes global modulation of protein-coding genes (PCGs), microRNAs, and transposable elements (TEs), affecting both the varied chromosome (cis-located) and unvaried genome regions (trans-located) across various taxa.

View Article and Find Full Text PDF