The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis.

G3 (Bethesda)

Division of Biostatistics and Data Science, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei 10055, Taiwan.

Published: January 2022


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with an implicit assumption that the multivariate expression values are normally distributed. This assumption is commonly adopted in GSAs, particularly those in the group of functional class scoring (FCS) methods. The validity of the normality assumption, however, has been disputed in several studies, yet no systematic analysis has been carried out to assess the effect of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal (MVN) distribution. Six statistical methods in three categories of MVN tests were considered and applied to a total of 24 RNA data sets. These RNA values were collected from cancer patients as well as normal subjects, and the values were derived from microarray experiments, RNA sequencing, and single-cell RNA sequencing. Our first finding suggests that the MVN assumption is not always satisfied. This assumption does not hold true in many applications tested here. In the second part of this research, we evaluated the influence of non-normality on the statistical power of current FCS methods, both parametric and nonparametric ones. Specifically, the scenario of mixture distributions representing more than one population for the RNA values was considered. This second investigation demonstrates that the non-normality distribution of the RNA values causes a loss in the statistical power of these GSA tests, especially when subtypes exist. Among the FCS GSA tools examined here and among the scenarios studied in this research, the N-statistics outperform the others. Based on the results from these two investigations, we conclude that the assumption of MVN should be used with caution when evaluating new GSA tools, since this assumption cannot be guaranteed and violation may lead to spurious results, loss of power, and incorrect comparison between methods. If a newly proposed GSA tool is to be evaluated, we recommend the incorporation of a wide range of multivariate non-normal distributions or sampling from large databases if available.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8728032PMC
http://dx.doi.org/10.1093/g3journal/jkab365DOI Listing

Publication Analysis

Top Keywords

rna values
12
functional class
8
class scoring
8
methods newly
8
newly proposed
8
assumption
8
fcs methods
8
rna sequencing
8
statistical power
8
gsa tools
8

Similar Publications

Neuro-Immuno-Stromal Context in Colorectal Cancer: An Enteric Glial Cell-Driven Prognostic Model via Machine Learning Predicts Survival, Recurrence, and Therapy Response.

Exp Cell Res

September 2025

Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Gastrointestinal Cancer Center, Peking University Cancer Hospital and Institute, Beijing, China. Electronic address:

Background: Enteric glial cells (EGCs) have been implicated in colorectal cancer (CRC) progression. This study aimed to develop and validate a prognostic model integrating EGC- and CRC-associated gene expression to predict patient survival, recurrence, metastasis, and therapy response.

Methods: Bulk and single-cell RNA sequencing data were analyzed, and a machine learning-based model was constructed using the RSF random forest algorithm.

View Article and Find Full Text PDF

Machine learning-based identification of a transcriptomic blood signature discriminating between systemic autoimmunity and infection.

Med

August 2025

Joint Academic Rheumatology Program, School of Medicine, National and Kapodistrian University of Athens, 11527 Athens, Greece; Centre of New Biotechnologies and Precision Medicine (CNBPM), School of Medicine, National and Kapodistrian University of Athens, 11527 Athens, Greece. Electronic address: p

Background: Pathogenic responses against self and foreign antigens in systemic autoimmunity and infection, respectively, engage similar immunologic components, thus lacking distinguishing diagnostic biomarkers. Herein, we tested whether whole-blood transcriptome analysis discriminates autoimmune from infectious diseases.

Methods: We applied nested cross-validation methodology to tune and validate random forests, k-nearest neighbors, and support vector machines, using a new preprocessing method on 22 publicly available datasets, including 594 patients with a broad spectrum of systemic autoimmune diseases and 615 patients with diverse viral, bacterial, and parasitic infections.

View Article and Find Full Text PDF

Four Gram-stain-negative, strictly aerobic, catalase- and oxidase-positive bacterial strains, designated 2201CG5-10, 2201CG14-23, 2201CG1-2-11, and 2304DJ70-9 were isolated from marine sponges collected in the Republic of Korea. Phylogenetic analyses based on 16S rRNA gene and whole-genome sequences revealed that these strains represent a distinct phyletic lineage within the genus Aquimarina. Based on the whole-genome sequence comparisons, the closest phylogenetic relatives of the four novel strains were Aquimarina latercula DSM 2041, Aquimarina pacifica SW150, and Aquimarina mytili PSC33, which shared average nucleotide identity values below 81.

View Article and Find Full Text PDF

Pooling samples allows for efficient and cost-effective surveillance of endemic pathogens, enabling broader testing coverage and reducing diagnostic costs. Pooling swine samples for influenza A virus surveillance without negatively impacting sensitivity would depend on the sample type, cycle threshold (Ct value), and dilution level. Therefore, this study aimed to compare the probability of IAV reverse transcription real-time polymerase chain reaction (RT-rtPCR) detection at different pooling levels in family oral fluids, udder wipes, and nasal wipes obtained from an endemic swine breeding herd.

View Article and Find Full Text PDF

The new variant of rabbit haemorrhagic disease virus (RHDV2 or RHDVb) is responsible for a lethal, emerging infectious disease in several species of lagomorphs, and is globally threatening wild rabbit populations. It is known that the gut microbiota plays a crucial role in modulating host health, including immune responses and disease susceptibility. We hypothesize potential association of gut microbiota with the epidemiological dynamics of RHDV2 outbreaks that may provide key insights into how this lethal, emerging pathogen impacts wild rabbit populations.

View Article and Find Full Text PDF