RNA-binding proteins (RBPs) are key regulators of gene expression; however, their RNA-binding specificities, that is, motifs, have not been comprehensively determined. Here we introduce Eukaryotic Protein-RNA Interactions (EuPRI), a freely available resource of RNA motifs for 34,746 RBPs from 690 eukaryotes. EuPRI includes in vitro binding data for 504 RBPs, including newly collected RNAcompete data for 174 RBPs, along with thousands of predicted motifs.
View Article and Find Full Text PDFMessenger RNA (mRNA) is central in gene expression, and its half-life, localization, and translation efficiency drive phenotypic diversity in eukaryotic cells. While supervised learning has widely been used to study the mRNA regulatory code, self-supervised foundation models support a wider range of transfer learning tasks. However, the dearth and homogeneity of standardized benchmarks limit efforts to pinpoint the strengths of various models.
View Article and Find Full Text PDFMutational signatures of single-base substitutions (SBSs) characterize somatic mutation processes which contribute to cancer development and progression. However, current mutational signatures do not distinguish the two independent steps that generate SBSs: the initial DNA damage followed by erroneous repair. To address this modelling gap we developed DAMUTA, a hierarchical Bayesian probabilistic model that infers separate signatures for each process, and captures their sample-specific interaction.
View Article and Find Full Text PDFAlternative splicing is essential for plants, enabling a single gene to produce multiple transcript variants to boost functional diversity and fine-tune responses to environmental and developmental cues. Arabidopsis thaliana At-RS31, a plant-specific splicing factor in the Serine/Arginine-rich (SR) protein family, responds to light and the Target of Rapamycin (TOR) signalling pathway, yet its downstream targets and regulatory impact remain unknown. To identify At-RS31 targets, we applied individual-nucleotide resolution crosslinking and immunoprecipitation (iCLIP) and RNAcompete assays.
View Article and Find Full Text PDFBMC Med Genomics
May 2025
Background: Recent decades have witnessed a steady decrease in the use of race categories in genomic studies. While studies that still include race categories vary in goal and type, these categories already build on a history during which racial color lines have been enforced and adjusted in the service of social and political systems of power and disenfranchisement. For early modern classification systems, data collection was also considerably arbitrary and limited.
View Article and Find Full Text PDFPLoS Comput Biol
December 2024
Phylogenies depicting the evolutionary history of genetically heterogeneous subpopulations of cells from the same cancer, i.e., cancer phylogenies, offer valuable insights about cancer development and guide treatment strategies.
View Article and Find Full Text PDFAlternative splicing is essential for plants, enabling a single gene to produce multiple transcript variants to boost functional diversity and fine-tune responses to environmental and developmental cues. At-RS31, a plant-specific splicing factor in the Serine/Arginine (SR)-rich protein family, responds to light and the Target of Rapamycin (TOR) signaling pathway, yet its downstream targets and regulatory impact remain unknown.To identify At-RS31 targets, we applied individual-nucleotide resolution crosslinking and immunoprecipitation (iCLIP) and RNAcompete assays.
View Article and Find Full Text PDFMost of the human genome is thought to be non-functional, and includes large segments often referred to as "dark matter" DNA. The genome also encodes hundreds of putative and poorly characterized transcription factors (TFs). We determined genomic binding locations of 166 uncharacterized human TFs in living cells.
View Article and Find Full Text PDFRNA-binding proteins (RBPs) are key regulators of gene expression. Here, we introduce EuPRI (Eukaryotic Protein-RNA Interactions) - a freely available resource of RNA motifs for 34,736 RBPs from 690 eukaryotes. EuPRI includes binding data for 504 RBPs, including newly collected RNAcompete data for 174 RBPs, along with thousands of reconstructed motifs.
View Article and Find Full Text PDFIn the face of rapidly accumulating genomic data, our ability to accurately predict key mature RNA properties that underlie transcript function and regulation remains limited. Pre-trained genomic foundation models offer an avenue to adapt learned RNA representations to biological prediction tasks. However, existing genomic foundation models are trained using strategies borrowed from textual or visual domains that do not leverage biological domain knowledge.
View Article and Find Full Text PDFCancers differ in how they establish metastases. These differences can be studied by reconstructing the metastatic spread of a cancer from sequencing data of multiple tumors. Current methods to do so are limited by computational scalability and rely on technical assumptions that do not reflect current clinical knowledge.
View Article and Find Full Text PDFSubclonal reconstruction algorithms use bulk DNA sequencing data to quantify parameters of tumor evolution, allowing an assessment of how cancers initiate, progress and respond to selective pressures. We launched the ICGC-TCGA (International Cancer Genome Consortium-The Cancer Genome Atlas) DREAM Somatic Mutation Calling Tumor Heterogeneity and Evolution Challenge to benchmark existing subclonal reconstruction algorithms. This 7-year community effort used cloud computing to benchmark 31 subclonal reconstruction algorithms on 51 simulated tumors.
View Article and Find Full Text PDFThe basal breast cancer subtype is enriched for triple-negative breast cancer (TNBC) and displays consistent large chromosomal deletions. Here, we characterize evolution and maintenance of chromosome 4p (chr4p) loss in basal breast cancer. Analysis of The Cancer Genome Atlas data shows recurrent deletion of chr4p in basal breast cancer.
View Article and Find Full Text PDFStem cells regulate their self-renewal and differentiation fate outcomes through both symmetric and asymmetric divisions. mA RNA methylation controls symmetric commitment and inflammation of hematopoietic stem cells (HSCs) through unknown mechanisms. Here, we demonstrate that the nuclear speckle protein SON is an essential mA target required for murine HSC self-renewal, symmetric commitment, and inflammation control.
View Article and Find Full Text PDFTumor type guides clinical treatment decisions in cancer, but histology-based diagnosis remains challenging. Genomic alterations are highly diagnostic of tumor type, and tumor type classifiers trained on genomic features have been explored, but the most accurate methods are not clinically feasible, relying on features derived from whole genome sequencing (WGS), or predicting across limited cancer types. We use genomic features from a dataset of 39,787 solid tumors sequenced using a clinical targeted cancer gene panel to develop Genome-Derived-Diagnosis Ensemble (GDD-ENS): a hyperparameter ensemble for classifying tumor type using deep neural networks.
View Article and Find Full Text PDFThe mRNA 3' poly(A) tail plays a critical role in regulating both mRNA translation and turnover. It is bound by the cytoplasmic poly(A) binding protein (PABPC), an evolutionarily conserved protein that can interact with translation factors and mRNA decay machineries to regulate gene expression. Mammalian PABPC1, the prototypical PABPC, is expressed in most tissues and interacts with eukaryotic translation initiation factor 4G (eIF4G) to stimulate translation in specific contexts.
View Article and Find Full Text PDFThousands of RNA-binding proteins (RBPs) crosslink to cellular mRNA. Among these are numerous unconventional RBPs (ucRBPs)-proteins that associate with RNA but lack known RNA-binding domains (RBDs). The vast majority of ucRBPs have uncharacterized RNA-binding specificities.
View Article and Find Full Text PDFCancer genomes harbor a catalog of somatic mutations. The type and genomic context of these mutations depend on their causes and allow their attribution to particular mutational signatures. Previous work has shown that mutational signature activities change over the course of tumor development, but investigations of genomic region variability in mutational signatures have been limited.
View Article and Find Full Text PDFSTAR Protoc
December 2022
Pairtree is a clone tree reconstruction algorithm that uses somatic point mutations to build clone trees describing the evolutionary history of individual cancers. Using the Pairtree software package, we describe steps to preprocess somatic mutation data, cluster mutations into subclones, search for clone trees, and visualize clone trees. Pairtree builds clone trees using up to 100 samples from a single cancer with at least 30 subclonal populations.
View Article and Find Full Text PDFThe coronavirus disease 2019 (COVID-19) pandemic has caused millions of deaths around the world and revealed the need for data-driven models of pandemic spread. Accurate pandemic caseload forecasting allows informed policy decisions on the adoption of non-pharmaceutical interventions (NPIs) to reduce disease transmission. Using COVID-19 as an example, we present Pandemic conditional Ordinary Differential Equation (PAN-cODE), a deep learning method to forecast daily increases in pandemic infections and deaths.
View Article and Find Full Text PDF