The khmer software package: enabling efficient nucleotide sequence analysis.

Michael R Crusoe , Hussien F Alameldin , Sherine Awad , Elmar Boucher , Adam Caldwell , Reed Cartwright , Amanda Charbonneau , Bede Constantinides , Greg Edvenson , Scott Fay , Jacob Fenton , Thomas Fenzl , Jordan Fish , Leonor Garcia-Gutierrez , Phillip Garland , Jonathan Gluck , Iván González , Sarah Guermond , Jiarong Guo , Aditi Gupta

F1000Res

Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA ; Population Health and Reproduction, University of California, Davis, Davis, CA, USA ; Computer Science and Engineering, Michigan State University, East Lansing, MI, USA.

Published: November 2015

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4608353	PMC
http://dx.doi.org/10.12688/f1000research.6924.1	DOI Listing

Publication Analysis

Top Keywords

bruijn graph

khmer

khmer software

software package

package enabling

enabling efficient

efficient nucleotide

nucleotide sequence

sequence analysis

analysis khmer

Similar Publications

Autocycler: long-read consensus assembly for bacterial genomes.

Bioinformatics

August 2025

Department of Microbiology and Immunology, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria Australia.

Ryan R Wick , Benjamin P Howden , Timothy P Stinear

Motivation: Long-read sequencing enables complete bacterial genome assemblies, but individual assemblers are imperfect and often produce sequence-level and structural errors. Consensus assembly using Trycycler can improve accuracy, but its lack of automation limits scalability. There is a need for an automated method to generate high-quality consensus bacterial genome assemblies from long-read data.

View Article and Find Full Text PDF

Similar Publications

Integrated histopathologic modeling of detailed tumor subtypes and actionable biomarkers.

bioRxiv

August 2025

Kevin M Boehm , Madison Darmofal , Arfath Pasha , Andrew Aukerman , Raymond Lim

Accurate cancer subtyping with accompanying molecular characterization is critical for precision oncology. While machine learning approaches have been applied to both digital pathology and cancer genomics, previous work has been limited in sample size and has typically aggregated granular cancer subtypes into coarse groupings , likely obfuscating informative molecular and prognostic associations and phenotypic variation of more detailed tumor subtypes. Accordingly, we collated 378,123 hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) with matched targeted DNA clinical sequencing results and OncoTree detailed cancer subtypes from a real-world cohort of 71,142 patients.

View Article and Find Full Text PDF

Similar Publications

On the Coverage Required for Diploid Genome Assembly.

IEEE Trans Comput Biol Bioinform

July 2025

Daanish Mahajan , Chirag Jain , Navin Kashyap

The repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome up to switch errors.

View Article and Find Full Text PDF

Similar Publications

Metagenomic CRISPR Array Analysis Tool: a novel graph-based approach to finding CRISPR arrays in metagenomic datasets.

Microlife

July 2025

University of Stuttgart, Institute of Biomedical Genetics, Department of RNA Biology and Bioinformatics, Allmandring 31, 70569 Stuttgart, Germany.

Fikrat Talibli , Björn Voß

Clustered Regularly Interspersed Short Palindromic Repeats and CRISPR-associated genes (CRISPR-Cas) is a bacterial immune system also famous for its use in genome editing. The diversity of known systems could be significantly increased by metagenomic data. Here we present the Metagenomic CRISPR Array Analysis Tool (MCAAT), a highly sensitive algorithm for finding CRISPR arrays in unassembled metagenomic data.

View Article and Find Full Text PDF

Similar Publications

Current Progress in Phased Genome Assembly from Long-Read DNA Sequencing Data.

Methods Mol Biol

July 2025

Systems and computing Engineering Department, Universidad de los Andes, Bogotá, Colombia.

Jorge Ivan Diaz-Riaño , Jorge Duitama

Genome assembly is a core task in the field of genomics. The availability of long-read sequencing technologies enabled the construction of high-quality complex genomes, including phasing of heterozygous contigs. This chapter provides an overview of the main algorithmic techniques for genome assembly.

View Article and Find Full Text PDF

Similar Publications