98%
921
2 minutes
20
Identification of gene-by-environment interactions (GxE) is crucial to understand the interplay of environmental effects on complex traits. However, current methods evaluating GxE on biobank-scale datasets have limitations. We introduce MonsterLM, a multiple linear regression method that does not rely on model specification and provides unbiased estimates of variance explained by GxE. We demonstrate robustness of MonsterLM through comprehensive genome-wide simulations using real genetic data from 325,989 individuals. We estimate GxE using waist-to-hip-ratio, smoking, and exercise as the environmental variables on 13 outcomes (N = 297,529-325,989) in the UK Biobank. GxE variance is significant for 8 environment-outcome pairs, ranging from 0.009 - 0.071. The majority of GxE variance involves SNPs without strong marginal or interaction associations. We observe modest improvements in polygenic score prediction when incorporating GxE. Our results imply a significant contribution of GxE to complex trait variance and we show MonsterLM to be well-purposed to handle this with biobank-scale data.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457310 | PMC |
http://dx.doi.org/10.1038/s41467-023-40913-7 | DOI Listing |
bioRxiv
August 2025
Department of Computational Biology, Cornell University, Ithaca, NY.
Motivation: The Genotype Representation Graph (GRG) [DeHaas et al., 2025] is a graph representation of whole genome polymorphisms, designed to encode the variant hard-call information in phased whole genomes. It encodes the genotypes as an extremely compact graph that can be traversed efficiently, enabling dynamic programming-style algorithms on applications such as genome-wide association studies that run faster on biobank-scale data than existing alternatives.
View Article and Find Full Text PDFSci Adv
August 2025
Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA.
Alzheimer's disease (AD) and Parkinson's disease (PD) are influenced by genetic and environmental factors. We conducted a biobank-scale study to (i) identify endocrine, nutritional, metabolic, and digestive disorders with potential causal or temporal associations with AD/PD risk before diagnosis; (ii) assess plasma biomarkers' specificity for AD/PD in the context of co-occurring gut related traits and disorders; and (iii) integrate multimodal datasets to enhance AD/PD prediction. Our findings show that several disorders were associated with increased AD/PD risk before diagnosis, with variation in the strength and timing of associations across conditions.
View Article and Find Full Text PDFBioinformatics
July 2025
Data Science and Analytics Thrust, Hong Kong University of Science and Technology (Guangzhou), Guangzhou, 511453, China.
Motivation: Understanding the genomic foundation of human diversity and disease requires models that effectively capture sequence variation, such as single nucleotide polymorphisms (SNPs). While recent genomic foundation models have scaled to larger datasets and multi-species inputs, they often fail to account for the sparsity and redundancy inherent in human population data, such as those in the 1000 Genomes Project. SNPs are rare in humans, and current masked language models (MLMs) trained directly on whole-genome sequences may struggle to efficiently learn these variations.
View Article and Find Full Text PDFProc Natl Acad Sci U S A
June 2025
Department of Human Genetics, University of Chicago, Chicago, IL 60637.
One key component of study design in population genetics is the "geographic breadth" of a sample (i.e., how broad a region across which individuals are sampled).
View Article and Find Full Text PDFGigascience
January 2025
Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, OX3 7LF, UK.
Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF.
View Article and Find Full Text PDF