Publications by authors named "Siavash Mirarab"

Lateral gene transfer is a major evolutionary process in Bacteria and Archaea. Despite its importance, lateral gene transfer quantification in nature using traditional phylogenetic methods has been hampered by the rarity of most genes within the enormous microbial pangenomes. Here, we estimated lateral gene transfer rates within the epipelagic tropical and subtropical ocean using a global, randomized collection of single amplified genomes and a non-phylogenetic computational approach.

View Article and Find Full Text PDF

Many algorithms are available for inferring species trees from various input types while accounting for gene tree discordance. Several quartet-based species tree inference methods, collectively known as the ASTRAL family, are based on similar ideas and are in wide use. Here, we integrate all ASTRAL-like methods into a single package called ASTER, comprising several tools, each designed for a different input type: (i) ASTRAL for single-copy gene tree topologies, (ii) weighted ASTRAL (wASTRAL) for single-copy gene tees with branch length and/or support, (iii) ASTRAL-Pro for multi-copy gene tree topologies, (iv) CASTER for multiple sequence alignments, including genome alignments, and (v) WASTER for short-reads and assembled genomes.

View Article and Find Full Text PDF

Comparing each sequencing read in a sample to large databases of known genomes has become a fundamental tool with wide-ranging applications, including metagenomics. These comparisons can be based on read-to-genome alignment, which is relatively slow, especially if done with the high sensitivity needed to characterize queries without a close representation in the reference dataset. A more scalable alternative is assigning taxonomic labels to reads using signatures such as k-mer presence/absence.

View Article and Find Full Text PDF

Current genome sequencing initiatives across a wide range of life forms offer significant potential to enhance our understanding of evolutionary relationships and support transformative biological and medical applications. Species trees play a central role in many of these applications; however, despite the widespread availability of genome assemblies, accurate inference of species trees remains challenging due to the limited automation, substantial domain expertise, and computational resources required by conventional methods. To address this limitation, we present ROADIES, a fully automated pipeline to infer species trees starting from raw genome assemblies.

View Article and Find Full Text PDF

Species trees need to be dated for many downstream applications. Typical molecular dating methods take a phylogenetic tree with branch lengths in substitution units as well as a set of calibrations as input and convert the branch lengths of the species tree to the unit of time while being consistent with the pre-specified calibrations. When dating species trees from multi-locus genome-scale datasets, the branch lengths and sometimes the topology of the species tree are estimated using concatenation.

View Article and Find Full Text PDF

Unlabelled: Phylogenetic branch lengths are essential for many analyses, such as estimating divergence times, analyzing rate changes, and studying adaptation. However, true gene tree heterogeneity due to incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT) can complicate the estimation of species tree branch lengths. While several tools exist for estimating the topology of a species tree addressing various causes of gene tree discordance, much less attention has been paid to branch length estimation on multi-locus datasets.

View Article and Find Full Text PDF

Genomes contain mosaics of discordant evolutionary histories, challenging the accurate inference of the tree of life. Although genome-wide data are routinely used for discordance-aware phylogenomic analyses, because of modeling and scalability limitations, the current practice leaves out large chunks of genomes. As more high-quality genomes become available, we urgently need discordance-aware methods to infer the tree directly from a multiple genome alignment.

View Article and Find Full Text PDF

Using -mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which are rapidly growing. Although the increased density provides hope for improvements in accuracy, scalability is a concern.

View Article and Find Full Text PDF

Using -mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these down-stream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern.

View Article and Find Full Text PDF

Dating phylogenetic trees to obtain branch lengths in time units is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation.

View Article and Find Full Text PDF

Motivation: Phylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data.

View Article and Find Full Text PDF

Inference of species trees plays a crucial role in advancing our understanding of evolutionary relationships and has immense significance for diverse biological and medical applications. Extensive genome sequencing efforts are currently in progress across a broad spectrum of life forms, holding the potential to unravel the intricate branching patterns within the tree of life. However, estimating species trees starting from raw genome sequences is quite challenging, and the current cutting-edge methodologies require a series of error-prone steps that are neither entirely automated nor standardized.

View Article and Find Full Text PDF

Unlabelled: The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, most attempts at comparing and matching trees have focused on tree topology. Comparing branch lengths has been more elusive due to several challenges.

View Article and Find Full Text PDF

In this protocol paper, we review a set of methods developed in recent years for analyzing nuclear reads obtained from genome skimming. As the cost of sequencing drops, genome skimming (low-coverage shotgun sequencing of a sample) becomes increasingly a cost-effective method of measuring biodiversity at high resolution. While most practitioners only use assembled over-represented organelle reads from a genome skim, the vast majority of the reads are nuclear.

View Article and Find Full Text PDF
Article Synopsis
  • Relationships among avian lineages remain unresolved due to factors like species diversity, phylogenetic methods, and selection of genomic regions.
  • An analysis of 363 bird species' genomes reveals a well-supported evolutionary tree but highlights significant discrepancies among certain groups.
  • Findings suggest that after the Cretaceous-Palaeogene extinction, birds experienced increased population size and diversification, which offers a new foundational understanding for future research in avian evolution.
View Article and Find Full Text PDF

Genomes are typically mosaics of regions with different evolutionary histories. When speciation events are closely spaced in time, recombination makes the regions sharing the same history small, and the evolutionary history changes rapidly as we move along the genome. When examining rapid radiations such as the early diversification of Neoaves 66 Mya, typically no consistent history is observed across segments exceeding kilobases of the genome.

View Article and Find Full Text PDF

Motivation: Taxonomic classification of short reads and taxonomic profiling of metagenomic samples are well-studied yet challenging problems. The presence of species belonging to groups without close representation in a reference dataset is particularly challenging. While k-mer-based methods have performed well in terms of running time and accuracy, they tend to have reduced accuracy for such novel species.

View Article and Find Full Text PDF

Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance.

View Article and Find Full Text PDF

Principal Component Analysis (PCA) is a workhorse of modern data science. While PCA assumes the data conforms to Euclidean geometry, for specific data types, such as hierarchical and cyclic data structures, other spaces are more appropriate. We study PCA in space forms; that is, those with constant curvatures.

View Article and Find Full Text PDF

Phylogenetic trees provide a framework for organizing evolutionary histories across the tree of life and aid downstream comparative analyses such as metagenomic identification. Methods that rely on single-marker genes such as 16S rRNA have produced trees of limited accuracy with hundreds of thousands of organisms, whereas methods that use genome-wide data are not scalable to large numbers of genomes. We introduce updating trees using divide-and-conquer (uDance), a method that enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability.

View Article and Find Full Text PDF

Studies using 16S rRNA and shotgun metagenomics typically yield different results, usually attributed to PCR amplification biases. We introduce Greengenes2, a reference tree that unifies genomic and 16S rRNA databases in a consistent, integrated resource. By inserting sequences into a whole-genome phylogeny, we show that 16S rRNA and shotgun metagenomic data generated from the same samples agree in principal coordinates space, taxonomy and phenotype effect size when analyzed with the same tree.

View Article and Find Full Text PDF

Motivation: Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix.

View Article and Find Full Text PDF

Motivation: The phylogenetic signal of structural variation informs a more comprehensive understanding of evolution. As (near-)complete genome assembly becomes more commonplace, the next methodological challenge for inferring genome rearrangement trees is the identification of syntenic blocks of orthologous sequences. In this article, we studied 94 reference quality genomes of primarily Mycobacterium tuberculosis (Mtb) isolates as a benchmark to evaluate these methods.

View Article and Find Full Text PDF