98%
921
2 minutes
20
The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4608353 | PMC |
http://dx.doi.org/10.12688/f1000research.6924.1 | DOI Listing |
Bioinformatics
August 2025
Department of Microbiology and Immunology, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria Australia.
Motivation: Long-read sequencing enables complete bacterial genome assemblies, but individual assemblers are imperfect and often produce sequence-level and structural errors. Consensus assembly using Trycycler can improve accuracy, but its lack of automation limits scalability. There is a need for an automated method to generate high-quality consensus bacterial genome assemblies from long-read data.
View Article and Find Full Text PDFAccurate cancer subtyping with accompanying molecular characterization is critical for precision oncology. While machine learning approaches have been applied to both digital pathology and cancer genomics, previous work has been limited in sample size and has typically aggregated granular cancer subtypes into coarse groupings , likely obfuscating informative molecular and prognostic associations and phenotypic variation of more detailed tumor subtypes. Accordingly, we collated 378,123 hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) with matched targeted DNA clinical sequencing results and OncoTree detailed cancer subtypes from a real-world cohort of 71,142 patients.
View Article and Find Full Text PDFThe repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome up to switch errors.
View Article and Find Full Text PDFMicrolife
July 2025
University of Stuttgart, Institute of Biomedical Genetics, Department of RNA Biology and Bioinformatics, Allmandring 31, 70569 Stuttgart, Germany.
Clustered Regularly Interspersed Short Palindromic Repeats and CRISPR-associated genes (CRISPR-Cas) is a bacterial immune system also famous for its use in genome editing. The diversity of known systems could be significantly increased by metagenomic data. Here we present the Metagenomic CRISPR Array Analysis Tool (MCAAT), a highly sensitive algorithm for finding CRISPR arrays in unassembled metagenomic data.
View Article and Find Full Text PDFMethods Mol Biol
July 2025
Systems and computing Engineering Department, Universidad de los Andes, Bogotá, Colombia.
Genome assembly is a core task in the field of genomics. The availability of long-read sequencing technologies enabled the construction of high-quality complex genomes, including phasing of heterozygous contigs. This chapter provides an overview of the main algorithmic techniques for genome assembly.
View Article and Find Full Text PDF