Publications by authors named "Nathan D Olson"

Recent advances in genome sequencing have improved variant calling in complex regions of the human genome. However, it is difficult to quantify variant calling performance because existing standards often focus on specificity, neglecting completeness in difficult-to-analyze regions. To create a more comprehensive truth set, we used Mendelian inheritance in a large pedigree (CEPH-1463) to filter variants across PacBio high-fidelity (HiFi), Illumina and Oxford Nanopore Technologies platforms.

View Article and Find Full Text PDF

The Genome in a Bottle Consortium (GIAB), hosted by the National Institute of Standards and Technology (NIST), is developing new matched tumor-normal samples, the first explicitly consented for public dissemination of genomic data and cell lines. Here, we describe a comprehensive genomic dataset from the first individual, HG008, including DNA from an adherent, epithelial-like pancreatic ductal adenocarcinoma (PDAC) tumor cell line and matched normal cells from duodenal and pancreatic tissues. Data for the tumor-normal matched samples comes from seventeen distinct state-of-the-art whole genome measurement technologies, including high depth short and long-read bulk whole genome sequencing (WGS), single cell WGS, Hi-C, and karyotyping.

View Article and Find Full Text PDF

The sex chromosomes contain complex, important genes impacting medical phenotypes, but differ from the autosomes in their ploidy and large repetitive regions. To enable technology developers along with research and clinical laboratories to evaluate variant detection on male sex chromosomes X and Y, we create a small variant benchmark set with 111,725 variants for the Genome in a Bottle HG002 reference material. We develop an active evaluation approach to demonstrate the benchmark set reliably identifies errors in challenging genomic regions and across short and long read callsets.

View Article and Find Full Text PDF

Somatic mosaicism is an important cause of disease, but mosaic and somatic variants are often challenging to detect because they exist in only a fraction of cells. To address the need for benchmarking subclonal variants in normal cell populations, we developed a benchmark containing mosaic variants in the Genome in a Bottle Consortium (GIAB) HG002 reference material DNA from a large batch of a normal lymphoblastoid cell line. First, we used a somatic variant caller with high coverage (300x) Illumina whole genome sequencing data from the Ashkenazi Jewish trio to detect variants in HG002 not detected in at least 5% of cells from the combined parental data.

View Article and Find Full Text PDF
Article Synopsis
  • * The authors introduce "stratifications," or specific BED files, that outline different genomic contexts for GRCh37/38 and the new T2T-CHM13 reference, which includes previously challenging regions to sequence.
  • * They also compare the performance of sequencing benchmarks across these references, showing how difficult regions in CHM13 impact the overall performance, and provide a snakemake pipeline for generating stratifications to aid in optimizing sequencing platforms.
View Article and Find Full Text PDF
Article Synopsis
  • Current genomic variant calling pipelines are not one-size-fits-all, requiring developers and researchers to make subjective tradeoffs based on their specific applications.
  • StratoMod is introduced as a machine-learning tool that predicts germline variant calling errors in a data-driven way, improving the accuracy of variant detection, especially in complex genomic regions.
  • It offers insights into the impact of different reference methods on recall rates and helps identify clinically relevant variants that might be overlooked by existing pipelines, facilitating better decision-making in pipeline design.
View Article and Find Full Text PDF
Article Synopsis
  • * The 1000 Genomes Project and Oxford Nanopore Technologies are working together to produce LRS data from at least 800 samples to enhance the identification of genetic variations and better understand human genetic diversity.
  • * Initial analysis of 100 samples shows high accuracy in detecting genetic variants, including structural variants that disrupt gene function, and provides valuable data for the clinical genetics community to advance research on pathogenic variations.
View Article and Find Full Text PDF
Article Synopsis
  • The Genome in a Bottle Consortium (GIAB) is creating matched tumor-normal samples that are publicly consented for sharing genomic data and cell lines, focusing on pancreatic ductal adenocarcinoma (PDAC).
  • They provide a comprehensive genomic dataset from the first individual, combining high-depth DNA from tumor and normal cells using advanced whole genome sequencing technologies.
  • This open-access resource aims to help develop benchmarks for detecting genetic variants in cancer, fostering innovation in genome measurement and analysis tools.
View Article and Find Full Text PDF

Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies.

View Article and Find Full Text PDF
Article Synopsis
  • * The 1000 Genomes Project ONT Sequencing Consortium is working to generate LRS data from at least 800 samples to better understand human genetic variation and improve variant detection.
  • * Initial data from the first 100 samples show high accuracy in identifying structural variants and methylation signatures, creating a useful public resource for finding disease-related genetic changes.
View Article and Find Full Text PDF

Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits, and are linked to over 60 disease phenotypes. However, their complexity often excludes them from at-scale studies due to challenges with variant calling, representation, and lack of a genome-wide standard. To promote TR methods development, we create a comprehensive catalog of TR regions and explore its properties across 86 samples.

View Article and Find Full Text PDF

The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region.

View Article and Find Full Text PDF

In response to the COVID-19 pandemic, the National Institute of Standards and Technology released a synthetic RNA material for SARS-CoV-2 in June 2020. The goal was to rapidly produce a material to support molecular diagnostic testing applications. This material, referred to as Research Grade Test Material 10169, was shipped free of charge to laboratories across the globe to provide a non-hazardous material for assay development and assay calibration.

View Article and Find Full Text PDF

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels.

View Article and Find Full Text PDF

Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations.

View Article and Find Full Text PDF
Article Synopsis
  • The text discusses the growing need for precise engineering of biological functions in synthetic biology, especially for programmed sensing that regulates gene expression based on stimuli.
  • It introduces two innovative methods, in silico selection and machine-learning-enabled forward engineering, that leverage a comprehensive dataset to develop genetic sensors with specifically defined dose-response characteristics.
  • The methods demonstrate the capability to fine-tune genetic sensors for various performance metrics, such as sensitivity and output, and to predictively engineer new sensor mutations beyond the existing dataset.
View Article and Find Full Text PDF

Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as .

View Article and Find Full Text PDF

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome.

View Article and Find Full Text PDF

The precisionFDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging genomic regions. Starting with FASTQs, 20 challenge participants applied their variant-calling pipelines and submitted 64 variant call sets for one or more sequencing technologies (Illumina, PacBio HiFi, and Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with updated Genome in a Bottle benchmark sets and genome stratifications.

View Article and Find Full Text PDF

Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery.

View Article and Find Full Text PDF
Article Synopsis
  • The Telomere-to-Telomere Consortium has completed the human reference genome, addressing the previously unfinished heterochromatic regions and offering a sequence of 3.055 billion base pairs.
  • This new genome assembly, T2T-CHM13, includes gapless sequences for nearly all chromosomes, correcting errors found in earlier genome references.
  • The update introduces nearly 200 million new base pairs and includes important genomic features like centromeric satellite arrays and gene predictions, enabling more comprehensive genetic studies.
View Article and Find Full Text PDF

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly.

View Article and Find Full Text PDF