Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Identification of gene-by-environment interactions (GxE) is crucial to understand the interplay of environmental effects on complex traits. However, current methods evaluating GxE on biobank-scale datasets have limitations. We introduce MonsterLM, a multiple linear regression method that does not rely on model specification and provides unbiased estimates of variance explained by GxE. We demonstrate robustness of MonsterLM through comprehensive genome-wide simulations using real genetic data from 325,989 individuals. We estimate GxE using waist-to-hip-ratio, smoking, and exercise as the environmental variables on 13 outcomes (N = 297,529-325,989) in the UK Biobank. GxE variance is significant for 8 environment-outcome pairs, ranging from 0.009 - 0.071. The majority of GxE variance involves SNPs without strong marginal or interaction associations. We observe modest improvements in polygenic score prediction when incorporating GxE. Our results imply a significant contribution of GxE to complex trait variance and we show MonsterLM to be well-purposed to handle this with biobank-scale data.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457310PMC
http://dx.doi.org/10.1038/s41467-023-40913-7DOI Listing

Publication Analysis

Top Keywords

biobank-scale datasets
8
gxe
8
gxe variance
8
versatile fast
4
fast unbiased
4
unbiased method
4
method estimation
4
estimation gene-by-environment
4
gene-by-environment interaction
4
interaction effects
4

Similar Publications

Motivation: The Genotype Representation Graph (GRG) [DeHaas et al., 2025] is a graph representation of whole genome polymorphisms, designed to encode the variant hard-call information in phased whole genomes. It encodes the genotypes as an extremely compact graph that can be traversed efficiently, enabling dynamic programming-style algorithms on applications such as genome-wide association studies that run faster on biobank-scale data than existing alternatives.

View Article and Find Full Text PDF

Gut-brain nexus: Mapping multimodal links to neurodegeneration at biobank scale.

Sci Adv

August 2025

Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA.

Alzheimer's disease (AD) and Parkinson's disease (PD) are influenced by genetic and environmental factors. We conducted a biobank-scale study to (i) identify endocrine, nutritional, metabolic, and digestive disorders with potential causal or temporal associations with AD/PD risk before diagnosis; (ii) assess plasma biomarkers' specificity for AD/PD in the context of co-occurring gut related traits and disorders; and (iii) integrate multimodal datasets to enhance AD/PD prediction. Our findings show that several disorders were associated with increased AD/PD risk before diagnosis, with variation in the strength and timing of associations across conditions.

View Article and Find Full Text PDF

Motivation: Understanding the genomic foundation of human diversity and disease requires models that effectively capture sequence variation, such as single nucleotide polymorphisms (SNPs). While recent genomic foundation models have scaled to larger datasets and multi-species inputs, they often fail to account for the sparsity and redundancy inherent in human population data, such as those in the 1000 Genomes Project. SNPs are rare in humans, and current masked language models (MLMs) trained directly on whole-genome sequences may struggle to efficiently learn these variations.

View Article and Find Full Text PDF

One key component of study design in population genetics is the "geographic breadth" of a sample (i.e., how broad a region across which individuals are sampled).

View Article and Find Full Text PDF

Analysis-ready VCF at Biobank scale using Zarr.

Gigascience

January 2025

Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, OX3 7LF, UK.

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF.

View Article and Find Full Text PDF