Accurate variant penetrance estimation is crucial for precision medicine. We constructed machine learning (ML) models for 10 diseases using 1,347,298 participants with electronic health records, then applied them to an independent cohort with linked exome data. Resulting probabilities were used to evaluate ML penetrance of 1648 rare variants in 31 autosomal dominant disease-predisposition genes.
View Article and Find Full Text PDFWe evaluated whether predicted continuous disease representations could enhance genetic discovery beyond case-control genome-wide association study (GWAS) phenotypes across eight complex diseases in up to 485,448 UK Biobank participants. Predicted phenotypes had high genetic correlations with case-control phenotypes (median r = 0.66) but identified more independent associations (median 306 versus 125).
View Article and Find Full Text PDFUnderstanding the disease risk of genetic variants is fundamental to precision medicine. Estimates of penetrance-the probability of disease for individuals with a variant allele-rely on disease-specific cohorts, clinical testing and emerging electronic health record (EHR)-linked biobanks. These data sources, while valuable, each have limitations in quality, representativeness and analyzability.
View Article and Find Full Text PDFAtherosclerosis
February 2025
Background And Aims: An in silico quantitative score of coronary artery disease (ISCAD), built using machine learning and clinical data from electronic health records, has been shown to result in gradations of risk of subclinical atherosclerosis, coronary artery disease (CAD) sequelae, and mortality. Large-scale metabolite biomarker profiling provides increased portability and objectivity in machine learning for disease prediction and gradation. However, these models have not been fully leveraged.
View Article and Find Full Text PDFBackground: Diet is a key modifiable risk factor of coronary artery disease (CAD). However, the causal effects of specific dietary traits on CAD risk remain unclear. With the expansion of dietary data in population biobanks, Mendelian randomization (MR) could help enable the efficient estimation of causality in diet-disease associations.
View Article and Find Full Text PDFStudies have shown that drug targets with human genetic support are more likely to succeed in clinical trials. Hence, a tool integrating genetic evidence to prioritize drug target genes is beneficial for drug discovery. We built a genetic priority score (GPS) by integrating eight genetic features with drug indications from the Open Targets and SIDER databases.
View Article and Find Full Text PDFClin Infect Dis
September 2023
Background: Lyme disease is the most prevalent vector-borne disease in the US, yet its host factors are poorly understood and diagnostic tests are limited. We evaluated patients in a large health system to uncover cholesterol's role in the susceptibility, severity, and machine learning-based diagnosis of Lyme disease.
Methods: A longitudinal health system cohort comprised 1 019 175 individuals with electronic health record data and 50 329 with linked genetic data.
Systemic autoimmune rheumatic diseases (SARDs) can lead to irreversible damage if left untreated, yet these patients often endure long diagnostic journeys before being diagnosed and treated. Machine learning may help overcome the challenges of diagnosing SARDs and inform clinical decision-making. Here, we developed and tested a machine learning model to identify patients who should receive rheumatological evaluation for SARDs using longitudinal electronic health records of 161,584 individuals from two institutions.
View Article and Find Full Text PDFBackground: Causality between plasma triglyceride (TG) levels and atherosclerotic cardiovascular disease (ASCVD) risk remains controversial despite more than four decades of study and two recent landmark trials, STRENGTH, and REDUCE-IT. Further unclear is the association between TG levels and non-atherosclerotic diseases across organ systems.
Methods: Here, we conducted a phenome-wide, two-sample Mendelian randomization (MR) analysis using inverse-variance weighted (IVW) regression to systematically infer the causal effects of plasma TG levels on 2600 disease traits in the European ancestry population of UK Biobank.
Background: Binary diagnosis of coronary artery disease does not preserve the complexity of disease or quantify its severity or its associated risk with death; hence, a quantitative marker of coronary artery disease is warranted. We evaluated a quantitative marker of coronary artery disease derived from probabilities of a machine learning model.
Methods: In this cohort study, we developed and validated a coronary artery disease-predictive machine learning model using 95 935 electronic health records and assessed its probabilities as in-silico scores for coronary artery disease (ISCAD; range 0 [lowest probability] to 1 [highest probability]) in participants in two longitudinal biobank cohorts.
Phenome-wide association studies identified numerous loci associated with traits and diseases. To help interpret these associations, we constructed a phenome-wide network map of colocalized genes and phenotypes. We generated colocalized signals using the Genotype-Tissue Expression data and genome-wide association results in UK Biobank.
View Article and Find Full Text PDFGenetic risk for coronary artery disease (CAD) is commonly measured with polygenic risk scores (PRS); yet, the relationship of atherosclerotic burden with PRS in healthy individuals not at high clinical risk for CAD (ie, without a high pooled cohort equations [PCE] score) is unknown. Here, we implemented a novel recall-by-PRS strategy to measure coronary artery calcium (CAC) scores prospectively in 53 healthy individuals with extreme high PRS (median [IQR] PRS = 94% [83-98]) and low PRS (median [IQR] PRS = 3.6% [1.
View Article and Find Full Text PDFJ Am Coll Cardiol
March 2022
Background: Clinical features from electronic health records (EHRs) can be used to build a complementary tool to predict coronary artery disease (CAD) susceptibility.
Objectives: The purpose of this study was to determine whether an EHR score can improve CAD prediction and reclassification 1 year before diagnosis, beyond conventional clinical guidelines as determined by the pooled cohort equations (PCE) and a polygenic risk score for CAD.
Methods: We applied a machine learning framework using clinical features from the EHR in a multiethnic, clinical care cohort (BioMe) comprising 555 CAD cases and 6,349 control subjects and in a population-based cohort (UK Biobank) comprising 3,130 CAD cases and 378,344 control subjects for external validation.
Aims: Individuals with supranormal left ventricular ejection fraction (snLVEF; LVEF >70%) have increased mortality. However, the genetic and phenotypic profile of snLVEF remains unknown. This study aimed to determine the relationship of both snLVEF genetic risk and phenotype with survival and underdiagnosed heart failure (HF).
View Article and Find Full Text PDFImportance: Population-based assessment of disease risk associated with gene variants informs clinical decisions and risk stratification approaches.
Objective: To evaluate the population-based disease risk of clinical variants in known disease predisposition genes.
Design, Setting, And Participants: This cohort study included 72 434 individuals with 37 780 clinical variants who were enrolled in the BioMe Biobank from 2007 onwards with follow-up until December 2020 and the UK Biobank from 2006 to 2010 with follow-up until June 2020.
J Am Heart Assoc
November 2021
Background Despite advances in cardiovascular disease and risk factor management, mortality from ischemic heart failure (HF) in patients with coronary artery disease (CAD) remains high. Given the partial role of genetics in HF and lack of reliable risk stratification tools, we developed and validated a polygenic risk score for HF in patients with CAD, which we term HF-PRS. Methods and Results Using summary statistics from a recent genome-wide association study for HF, we developed candidate PRSs in the Mount Sinai Bio CAD patient cohort (N=6274) by using the pruning and thresholding method and LDPred.
View Article and Find Full Text PDFPurpose: Limited mechanical ventilators (MV) during the Coronavirus disease (COVID-19) pandemic have led to the use of non-invasive ventilation (NIV) in hypoxemic patients, which has not been studied well. We aimed to assess the association of NIV versus MV with mortality and morbidity during respiratory intervention among hypoxemic patients admitted with COVID-19.
Methods: We performed a retrospective multi-center cohort study across 5 hospitals during March-April 2020.
Biobanks with exomes linked to electronic health records (EHRs) enable the study of genetic pleiotropy between rare variants and seemingly disparate diseases. We performed robust clinical phenotyping of rare, putatively deleterious variants (loss-of-function [LoF] and deleterious missense variants) in ERCC6, a gene implicated in inherited retinal disease. We analyzed 213,084 exomes, along with a targeted set of retinal, cardiac, and immune phenotypes from two large-scale EHR-linked biobanks.
View Article and Find Full Text PDFDiabetic retinopathy (DR) is a common consequence in type 2 diabetes (T2D) and a leading cause of blindness in working-age adults. Yet, its genetic predisposition is largely unknown. Here, we examined the polygenic architecture underlying DR by deriving and assessing a genome-wide polygenic risk score (PRS) for DR.
View Article and Find Full Text PDFAdverse side effects often account for the failure of drug clinical trials. We evaluated whether a phenome-wide association study (PheWAS) of 1167 phenotypes in >360,000 U.K.
View Article and Find Full Text PDF