Generating correlated data for omics simulation.

Jianing Yang , Gregory R Grant , Thomas G Brooks

PLoS Comput Biol

Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.

Published: September 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Simulation of realistic omics data is a key input for benchmarking studies that help users obtain optimal computational pipelines. Omics data involves large numbers of measured features on each sample and these measures are generally correlated with each other. However, simulation too often ignores these correlations, perhaps due to computational and statistical hurdles of doing so. To alleviate this, we describe three approaches for generating omics-scale data with correlated measures which mimic real datasets. These approaches are all based on a Gaussian copula approach with a covariance matrix that decomposes into a diagonal part and a low-rank part. This decomposition allows for extremely efficient simulation, overcoming a hurdle for adoption of past methods. We use these approaches to demonstrate the importance of including correlation in two benchmarking applications. First, we show that variance of results from the popular DESeq2 method increases when dependence is included. Second, we demonstrate that CYCLOPS, a method for inferring circadian time of collection from transcriptomics, improves in performance when given gene-gene dependencies in some circumstances. We provide an R package, dependentsimr, that has efficient implementations of these methods and can generate dependent data with arbitrary marginal distributions, including discrete (binary, ordered categorical, Poisson, negative binomial), continuous (normal), or with an empirical distribution.

Download full-text PDF	Source
http://dx.doi.org/10.1371/journal.pcbi.1013392	DOI Listing

Publication Analysis

Top Keywords

omics data

data

generating correlated

correlated data

data omics

simulation

omics simulation

simulation simulation

simulation realistic

realistic omics

Similar Publications

Catalyzing computational biology research at an academic institute through an interest network.

PLoS Comput Biol

September 2025

Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, California, United States of America.

Jaroslav Zak , Ian Newman , Daniel J Montiel Garcia , Daniele Parisi , Janet Joy

Biology has been transformed by the rapid development of computing and the concurrent rise of data-rich approaches such as, omics or high-resolution imaging. However, there is a persistent computational skills gap in the biomedical research workforce. Inherent limitations of classroom teaching and institutional core support highlight the need for accessible ways for researchers to explore developments in computational biology.

View Article and Find Full Text PDF

Similar Publications

High-Dimensional Causal Mediation Analysis by Partial Sum Statistic and Sample Splitting Strategy in Imaging Genetics Application.

Bioinformatics

September 2025

Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylvania United States.

Hung-Ching Chang , Yusi Fang , Michael T Gorczyca , Kayhan Batmanghelich , George C Tseng

Summary: Causal mediation analysis investigates the role of mediators in the relationship between exposure and outcome. In the analysis of omics or imaging data, mediators are often high-dimensional, presenting challenges such as multicollinearity and interpretability. Existing methods either compromise interpretability or fail to effectively prioritize mediators.

View Article and Find Full Text PDF

Similar Publications

InterVelo: A Mutually Enhancing Model for Estimating Pseudotime and RNA Velocity in Multi-Omic Single-Cell Data.

Bioinformatics

September 2025

Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China.

Yurou Wang , Zhixiang Lin , Tao Wang

Motivation: RNA velocity has become a powerful tool for uncovering transcriptional dynamics in snapshot single-cell data. However, current RNA velocity approaches often assume constant transcriptional rates and treat genes independently with gene-specific times, which may introduce biases and deviate from biological realities. Here, we present InterVelo, a novel deep learning framework that simultaneously learns cellular pseudotime and RNA velocity.

View Article and Find Full Text PDF

Similar Publications

Deep learning methods and applications in single-cell multimodal data integration.

Mol Omics

September 2025

Laboratory of Structural Bioinformatics and Computational Biology, Federal University of Rio Grande do Sul, Av. Bento Gonçalves, 9500, Porto Alegre 91501-970, RS, Brazil.

Franklin Vinny Medina Nunes , Luiza Marques Prates Behrens , Rafael Diogo Weimer , Gabriela Flores Gonçalves , Guilherme da Silva Fernandes

The integration of multimodal single-cell omics data is a state-of-art strategy for deciphering cellular heterogeneity and gene regulatory mechanisms. Recent advances in single-cell technologies have enabled the comprehensive characterization of cellular states and their interactions. However, integrating these high-dimensional and heterogeneous datasets poses significant computational challenges, including batch effects, sparsity, and modality alignment.

View Article and Find Full Text PDF

Similar Publications

scSPAF: Cell Similarity Purified Adaptive Fusion Network for Single-Cell Multi-Omics Clustering.

IEEE Trans Comput Biol Bioinform

September 2025

Shanghui Deng , Xiao Zheng , Chang Tang , Xinwang Liu , Yuanyuan Liu

The rapid advancement of single-cell sequencing technology has generated vast amounts of multi-omics data, presenting unprecedented opportunities for single-cell multi-omics clustering analysis. However, existing single-cell clustering algorithms focus on extracting shared representations, overlooking the interactions and correlations among cells. This oversight inevitably leads to biased or confounded cell clustering results.

View Article and Find Full Text PDF

Similar Publications