Publications by authors named "Erik Garrison"

Recent advances in genome sequencing have improved variant calling in complex regions of the human genome. However, it is difficult to quantify variant calling performance because existing standards often focus on specificity, neglecting completeness in difficult-to-analyze regions. To create a more comprehensive truth set, we used Mendelian inheritance in a large pedigree (CEPH-1463) to filter variants across PacBio high-fidelity (HiFi), Illumina and Oxford Nanopore Technologies platforms.

View Article and Find Full Text PDF

Aligning to a linear reference genome can result in a higher percentage of reads going unmapped or being incorrectly mapped owing to variations not captured by the reference, otherwise known as reference bias. Recently, in efforts to mitigate reference bias, there has been a movement to switch to using pangenomes, a collection of genomes, as the reference. In this paper, we introduce Moni-align, the first short-read pangenome aligner built on the -index, a variation of the classical FM-index that can index collections of genomes in O()-space, where is the number of runs in the Burrows-Wheeler transform.

View Article and Find Full Text PDF

Understanding the human de novo mutation (DNM) rate requires complete sequence information. Here using five complementary short-read and long-read sequencing technologies, we phased and assembled more than 95% of each diploid human genome in a four-generation, twenty-eight-member family (CEPH 1463). We estimate 98-206 DNMs per transmission, including 74.

View Article and Find Full Text PDF

The most dynamic and repetitive regions of great ape genomes have traditionally been excluded from comparative studies. Consequently, our understanding of the evolution of our species is incomplete. Here we present haplotype-resolved reference genomes and comparative analyses of six ape species: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan and siamang.

View Article and Find Full Text PDF

Summary: With the rapid development of long-read sequencing technologies, the era of individual complete genomes is approaching. We have developed wgatools, a cross-platform, ultrafast toolkit that supports a range of whole-genome alignment formats, offering practical tools for conversion, processing, evaluation, and visualization of alignments, thereby facilitating population-level genome analysis and advancing functional and evolutionary genomics.

Availability And Implementation: wgatools supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide.

View Article and Find Full Text PDF

The HXB/BXH family of recombinant inbred rat strains is a unique genetic resource that has been extensively phenotyped over 25 years, resulting in a vast dataset of quantitative molecular and physiological phenotypes. We built a pangenome graph from 10x Genomics Linked-Read data for 31 recombinant inbred rats to study genetic variation and association mapping. The pangenome includes 0.

View Article and Find Full Text PDF

Structural variants (SVs) are widespread in vertebrate genomes, yet their evolutionary dynamics remain poorly understood. Using 45 long-read de novo genome assemblies and pangenome tools, we analyze SVs within three closely related species of North American jays (, scrub-jays) displaying a 60-fold range in effective population size. We find rapid evolution of genome architecture, including ~100 Mb variation in genome size driven by dynamic satellite landscapes with unexpectedly long (> 10 kb) repeat units and widespread variation in gene content, influencing gene expression.

View Article and Find Full Text PDF

The tetraploid genome and clonal propagation of the cultivated potato (Solanum tuberosum L.) dictate a slow, non-accumulative breeding mode of the most important tuber crop. Transitioning potato breeding to a seed-propagated hybrid system based on diploid inbred lines has the potential to greatly accelerate its improvement.

View Article and Find Full Text PDF
Article Synopsis
  • - Using only one linear reference genome limits the understanding of genomic diversity; the draft human pangenome shows the need for pangenomics to address these gaps and capture more genetic variation.
  • - A new tool called Panacus (pangenome-abacus) has been developed to efficiently analyze pangenomes, capable of processing large human pangenome graphs quickly, producing interactive visualizations in under an hour.
  • - Panacus is open-source and built in Rust, available for installation through Bioconda, with its source code and documentation accessible on GitHub.
View Article and Find Full Text PDF

The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy.

View Article and Find Full Text PDF

The current reference genome is the backbone of diverse and rich annotations. Simple text formats, like VCF or BED, have been widely adopted and helped the critical exchange of genomic information. There is a dire need for tools and formats enabling pangenomic annotation to facilitate such enrichment of pangenomic references.

View Article and Find Full Text PDF

We created GNQA, a generative pre-trained transformer (GPT) knowledge base driven by a performant retrieval augmented generation (RAG) with a focus on aging, dementia, Alzheimer's and diabetes. We uploaded a corpus of three thousand peer reviewed publications on these topics into the RAG. To address concerns about inaccurate responses and GPT 'hallucinations', we implemented a context provenance tracking mechanism that enables researchers to validate responses against the original material and to get references to the original papers.

View Article and Find Full Text PDF

Pangenome graphs can represent all variation between multiple reference genomes, but current approaches to build them exclude complex sequences or are based upon a single reference. In response, we developed the PanGenome Graph Builder, a pipeline for constructing pangenome graphs without bias or exclusion. The PanGenome Graph Builder uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events and infer phylogenetic relationships.

View Article and Find Full Text PDF
Article Synopsis
  • Pangenome graphs are useful for capturing genomic variability but current methods often come with biases and resource inefficiencies.
  • The new nf-core/pangenome provides a reference-unbiased approach that is efficient, scalable, and uses biocontainers for easy deployment on high-performance computing (HPC) systems.
  • This tool has shown to be significantly faster than existing methods, managing to process extensive genomic data with a lower environmental impact, and is openly available on GitHub and Zenodo for public use.
View Article and Find Full Text PDF

Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes, including two that can decouple from private data, while enhance long DNA sequence generation.

View Article and Find Full Text PDF

Robertsonian chromosomes are a type of variant chromosome found commonly in nature. Present in one in 800 humans, these chromosomes can underlie infertility, trisomies, and increased cancer incidence. Recognized cytogenetically for more than a century, their origins have remained mysterious.

View Article and Find Full Text PDF

Ribosomal RNA (rRNA) genes exist in multiple copies arranged in tandem arrays known as ribosomal DNA (rDNA). The total number of gene copies is variable, and the mechanisms buffering this copy number variation remain unresolved. We surveyed the number, distribution, and activity of rDNA arrays at the level of individual chromosomes across multiple human and primate genomes.

View Article and Find Full Text PDF
Article Synopsis
  • * The 1000 Genomes Project and Oxford Nanopore Technologies are working together to produce LRS data from at least 800 samples to enhance the identification of genetic variations and better understand human genetic diversity.
  • * Initial analysis of 100 samples shows high accuracy in detecting genetic variants, including structural variants that disrupt gene function, and provides valuable data for the clinical genetics community to advance research on pathogenic variations.
View Article and Find Full Text PDF
Article Synopsis
  • A new toolkit has been created to facilitate whole genome alignment, processing, and analysis as long-read sequencing technologies advance, making individual complete genomes more accessible.
  • This toolkit supports various formats and offers features like alignment-based variant calling and visualization, enabling effective population-level genome analysis.
  • Developed in Rust for efficiency and safety, the software is free and open-source, available on GitHub, and capable of handling large datasets of numerous genomes.
View Article and Find Full Text PDF

The adoption of agriculture triggered a rapid shift towards starch-rich diets in human populations. Amylase genes facilitate starch digestion, and increased amylase copy number has been observed in some modern human populations with high-starch intake, although evidence of recent selection is lacking. Here, using 94 long-read haplotype-resolved assemblies and short-read data from approximately 5,600 contemporary and ancient humans, we resolve the diversity and evolutionary history of structural variation at the amylase locus.

View Article and Find Full Text PDF

Using five complementary short- and long-read sequencing technologies, we phased and assembled >95% of each diploid human genome in a four-generation, 28-member family (CEPH 1463) allowing us to systematically assess mutations (DNMs) and recombination. From this family, we estimate an average of 192 DNMs per generation, including 75.5 single-nucleotide variants (SNVs), 7.

View Article and Find Full Text PDF
Article Synopsis
  • The study presents detailed genomes of six ape species, achieving high accuracy and complete sequencing of all their chromosomes.
  • It addresses complex genomic regions, leading to enhanced understanding of evolutionary relationships among these species.
  • The findings will serve as a crucial resource for future research on human evolution and our closest ape relatives.
View Article and Find Full Text PDF

Motivation: The increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them.

View Article and Find Full Text PDF

Motivation: Using a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence.

View Article and Find Full Text PDF

De novo genome assemblies are common tools for examining novel biological phenomena in non-model organisms. Here, we present a protocol for preparing Drosophila genomic DNA to create chromosome-level de novo genome assemblies. We describe steps for high-molecular-weight DNA preparation with phenol or Genomic-tips, quality control, long-read nanopore sequencing, short-read DNA library preparation, and sequencing.

View Article and Find Full Text PDF