Generating correlated data for omics simulation.

PLoS Comput Biol

Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.

Published: September 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Simulation of realistic omics data is a key input for benchmarking studies that help users obtain optimal computational pipelines. Omics data involves large numbers of measured features on each sample and these measures are generally correlated with each other. However, simulation too often ignores these correlations, perhaps due to computational and statistical hurdles of doing so. To alleviate this, we describe three approaches for generating omics-scale data with correlated measures which mimic real datasets. These approaches are all based on a Gaussian copula approach with a covariance matrix that decomposes into a diagonal part and a low-rank part. This decomposition allows for extremely efficient simulation, overcoming a hurdle for adoption of past methods. We use these approaches to demonstrate the importance of including correlation in two benchmarking applications. First, we show that variance of results from the popular DESeq2 method increases when dependence is included. Second, we demonstrate that CYCLOPS, a method for inferring circadian time of collection from transcriptomics, improves in performance when given gene-gene dependencies in some circumstances. We provide an R package, dependentsimr, that has efficient implementations of these methods and can generate dependent data with arbitrary marginal distributions, including discrete (binary, ordered categorical, Poisson, negative binomial), continuous (normal), or with an empirical distribution.

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1013392DOI Listing

Publication Analysis

Top Keywords

omics data
8
data
5
generating correlated
4
correlated data
4
data omics
4
simulation
4
omics simulation
4
simulation simulation
4
simulation realistic
4
realistic omics
4

Similar Publications

Catalyzing computational biology research at an academic institute through an interest network.

PLoS Comput Biol

September 2025

Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, California, United States of America.

Biology has been transformed by the rapid development of computing and the concurrent rise of data-rich approaches such as, omics or high-resolution imaging. However, there is a persistent computational skills gap in the biomedical research workforce. Inherent limitations of classroom teaching and institutional core support highlight the need for accessible ways for researchers to explore developments in computational biology.

View Article and Find Full Text PDF

Summary: Causal mediation analysis investigates the role of mediators in the relationship between exposure and outcome. In the analysis of omics or imaging data, mediators are often high-dimensional, presenting challenges such as multicollinearity and interpretability. Existing methods either compromise interpretability or fail to effectively prioritize mediators.

View Article and Find Full Text PDF

InterVelo: A Mutually Enhancing Model for Estimating Pseudotime and RNA Velocity in Multi-Omic Single-Cell Data.

Bioinformatics

September 2025

Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China.

Motivation: RNA velocity has become a powerful tool for uncovering transcriptional dynamics in snapshot single-cell data. However, current RNA velocity approaches often assume constant transcriptional rates and treat genes independently with gene-specific times, which may introduce biases and deviate from biological realities. Here, we present InterVelo, a novel deep learning framework that simultaneously learns cellular pseudotime and RNA velocity.

View Article and Find Full Text PDF

The integration of multimodal single-cell omics data is a state-of-art strategy for deciphering cellular heterogeneity and gene regulatory mechanisms. Recent advances in single-cell technologies have enabled the comprehensive characterization of cellular states and their interactions. However, integrating these high-dimensional and heterogeneous datasets poses significant computational challenges, including batch effects, sparsity, and modality alignment.

View Article and Find Full Text PDF

The rapid advancement of single-cell sequencing technology has generated vast amounts of multi-omics data, presenting unprecedented opportunities for single-cell multi-omics clustering analysis. However, existing single-cell clustering algorithms focus on extracting shared representations, overlooking the interactions and correlations among cells. This oversight inevitably leads to biased or confounded cell clustering results.

View Article and Find Full Text PDF