SCARAP: scalable cross-species comparative genomics of prokaryotes.

Bioinformatics

Lab of Applied Microbiology and Biotechnology, Department of Bioscience Engineering, University of Antwerp, Antwerpen 2020, Belgium.

Published: December 2024


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Motivation: Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly.

Results: Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions.

Availability And Implementation: The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681940PMC
http://dx.doi.org/10.1093/bioinformatics/btae735DOI Listing

Publication Analysis

Top Keywords

pangenome inference
16
core genome
16
comparative genomics
12
genome inference
8
gene families
8
inference module
8
pangenome
7
inference
6
module
6
scarap
5

Similar Publications

Gene co-occurrence and its association with phage infectivity in bacterial pangenomes.

Philos Trans R Soc Lond B Biol Sci

September 2025

Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, Tübingen, Baden-Württemberg, Germany.

Phages infect bacteria and have recently re-emerged as a promising strategy to combat bacterial infections. However, there is a lack of methods to predict whether and why a particular phage can or cannot infect a bacterial strain based on their genome sequences. Understanding the complex interactions between phages and their bacterial hosts is thus of considerable interest.

View Article and Find Full Text PDF

Background: The increasing amount of available genome sequence data enables large-scale comparative studies. A common task is the inference of phylogenies- a challenging task if close reference sequences are not available, genome sequences are incompletely assembled, or the high number of genomes precludes multiple sequence alignment in reasonable time. SANS is an alignment-free, whole-genome based approach for phylogeny estimation.

View Article and Find Full Text PDF

White-Nose Syndrome (WNS) has devastated insectivorous bat populations, particularly in North America, leading to severe ecological and economic consequences. Despite extensive research, many aspects of the evolutionary history, mitochondrial genome organization, and metabolic adaptations of its etiological agent, , remain unexplored. Here, we present a multi-scale genomic analysis integrating pangenome reconstruction, phylogenetic inference, Bayesian divergence dating, comparative mitochondrial genomics, and refined functional annotation.

View Article and Find Full Text PDF

Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance.

View Article and Find Full Text PDF

With the increasing severity of bacterial drug resistance, there is a growing need for phages with well-defined genetic backgrounds to combat drug-resistant infections. Mycobacteriophages constitute the largest genome-sequenced phage group; however, the vast majority of these phage proteins have not yet been effectively annotated. In this study, we employed a structure-based similarity search approach to improve protein annotation.

View Article and Find Full Text PDF