Understanding genetic differences between populations is essential for avoiding confounding in genome-wide association studies and improving polygenic score (PGS) portability. We developed a statistical pipeline to infer fine-scale Ancestry Components and applied it to UK Biobank data. Ancestry Components identify population structure not captured by widely used principal components, improving stratification correction for geographically correlated traits.
View Article and Find Full Text PDFAm J Hum Genet
October 2024
Gene-based burden tests are a popular and powerful approach for analysis of exome-wide association studies. These approaches combine sets of variants within a gene into a single burden score that is then tested for association. Typically, a range of burden scores are calculated and tested across a range of annotation classes and frequency bins.
View Article and Find Full Text PDFWhole-genome sequencing (WGS), whole-exome sequencing (WES) and array genotyping with imputation (IMP) are common strategies for assessing genetic variation and its association with medically relevant phenotypes. To date, there has been no systematic empirical assessment of the yield of these approaches when applied to hundreds of thousands of samples to enable the discovery of complex trait genetic signals. Using data for 100 complex traits from 149,195 individuals in the UK Biobank, we systematically compare the relative yield of these strategies in genetic association studies.
View Article and Find Full Text PDFWe built a reference panel with 342 million autosomal variants using 78,195 individuals from the Genomics England (GEL) dataset, achieving a phasing switch error rate of 0.18% for European samples and imputation quality of r = 0.75 for variants with minor allele frequencies as low as 2 × 10 in white British samples.
View Article and Find Full Text PDFHuman genetic studies of smoking behavior have been thus far largely limited to common variants. Studying rare coding variants has the potential to identify drug targets. We performed an exome-wide association study of smoking phenotypes in up to 749,459 individuals and discovered a protective association in CHRNB2, encoding the β2 subunit of the α4β2 nicotine acetylcholine receptor.
View Article and Find Full Text PDFCoding variants that have significant impact on function can provide insights into the biology of a gene but are typically rare in the population. Identifying and ascertaining the frequency of such rare variants requires very large sample sizes. Here, we present the largest catalog of human protein-coding variation to date, derived from exome sequencing of 985,830 individuals of diverse ancestry to serve as a rich resource for studying rare coding variants.
View Article and Find Full Text PDFClonal haematopoiesis involves the expansion of certain blood cell lineages and has been associated with ageing and adverse health outcomes. Here we use exome sequence data on 628,388 individuals to identify 40,208 carriers of clonal haematopoiesis of indeterminate potential (CHIP). Using genome-wide and exome-wide association analyses, we identify 24 loci (21 of which are novel) where germline genetic variation influences predisposition to CHIP, including missense variants in the lymphocytic antigen coding gene LY75, which are associated with reduced incidence of CHIP.
View Article and Find Full Text PDFBackground: There is large individual variation in both clinical presentation and progression between Parkinson's disease patients. Generation of deeply and longitudinally phenotyped patient cohorts has enormous potential to identify disease subtypes for prognosis and therapeutic targeting.
Methods: Replicating across three large Parkinson's cohorts (Oxford Discovery cohort (n = 842)/Tracking UK Parkinson's study (n = 1807) and Parkinson's Progression Markers Initiative (n = 472)) with clinical observational measures collected longitudinally over 5-10 years, we developed a Bayesian multiple phenotypes mixed model incorporating genetic relationships between individuals able to explain many diverse clinical measurements as a smaller number of continuous underlying factors ("phenotypic axes").
Body fat distribution is a major, heritable risk factor for cardiometabolic disease, independent of overall adiposity. Using exome-sequencing in 618,375 individuals (including 160,058 non-Europeans) from the UK, Sweden and Mexico, we identify 16 genes associated with fat distribution at exome-wide significance. We show 6-fold larger effect for fat-distribution associated rare coding variants compared with fine-mapped common alleles, enrichment for genes expressed in adipose tissue and causal genes for partial lipodystrophies, and evidence of sex-dimorphism.
View Article and Find Full Text PDFBackground: Exome sequencing in hundreds of thousands of persons may enable the identification of rare protein-coding genetic variants associated with protection from human diseases like liver cirrhosis, providing a strategy for the discovery of new therapeutic targets.
Methods: We performed a multistage exome sequencing and genetic association analysis to identify genes in which rare protein-coding variants were associated with liver phenotypes. We conducted in vitro experiments to further characterize associations.
To better understand the genetics of hearing loss, we performed a genome-wide association meta-analysis with 125,749 cases and 469,497 controls across five cohorts. We identified 53/c loci affecting hearing loss risk, including common coding variants in COL9A3 and TMPRSS3. Through exome sequencing of 108,415 cases and 329,581 controls, we observed rare coding associations with 11 Mendelian hearing loss genes, including additive effects in known hearing loss genes GJB2 (Gly12fs; odds ratio [OR] = 1.
View Article and Find Full Text PDFA major goal in human genetics is to use natural variation to understand the phenotypic consequences of altering each protein-coding gene in the genome. Here we used exome sequencing to explore protein-altering variants and their consequences in 454,787 participants in the UK Biobank study. We identified 12 million coding variants, including around 1 million loss-of-function and around 1.
View Article and Find Full Text PDFProc Natl Acad Sci U S A
October 2021
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations.
View Article and Find Full Text PDFSevere acute respiratory syndrome coronavirus-2 (SARS-CoV-2) causes coronavirus disease 2019 (COVID-19), a respiratory illness that can result in hospitalization or death. We used exome sequence data to investigate associations between rare genetic variants and seven COVID-19 outcomes in 586,157 individuals, including 20,952 with COVID-19. After accounting for multiple testing, we did not identify any clear associations with rare variants either exome wide or when specifically focusing on (1) 13 interferon pathway genes in which rare deleterious variants have been reported in individuals with severe COVID-19, (2) 281 genes located in susceptibility loci identified by the COVID-19 Host Genetics Initiative, or (3) 32 additional genes of immunologic relevance and/or therapeutic potential.
View Article and Find Full Text PDFGenome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory.
View Article and Find Full Text PDF