Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Motivation: Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy.

Results: We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data.

Availability And Implementation: MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache.

Contact: bertil.schmidt@uni-mainz.de.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btx520DOI Listing

Publication Analysis

Top Keywords

context-aware classification
8
read classification
8
reference genomes
8
kraken clark
8
classification
5
metacache context-aware
4
classification metagenomic
4
reads
4
metagenomic reads
4
reads minhashing
4

Similar Publications

Introduction: Manual ICD-10 coding of German clinical texts is time-consuming and error-prone. This project aims to develop a semi-automated pipeline for efficient coding of unstructured medical documentation.

State Of The Art: Existing approaches often rely on fine-tuned language models that require large datasets and perform poorly on rare codes, particularly in low-resource languages such as German.

View Article and Find Full Text PDF

Sarcasm is a form of sentiment often used for comedic effect. Its widespread use contributes to frequent misinterpretation of humour-based comments among native Bengali speakers. The growing prevalence of sarcasm in the Bengali language necessitates further study using natural language processing, as detecting Bengali sarcasm remains particularly challenging.

View Article and Find Full Text PDF

The increased usage of smart sensors has introduced both opportunities and complexities in managing residential energy consumption. Despite advancements in sensor data analytics and machine learning (ML), existing energy management systems (EMS) remain limited in interpretability, adaptability, and user engagement. This paper presents EnergiQ, an intelligent, end-to-end platform that leverages sensors and Large Language Models (LLMs) to bridge the gap between technical energy analytics and user comprehension.

View Article and Find Full Text PDF

This study investigates the impact of back-translation on topic classification, comparing its effects on static word vector representations (FastText) and contextual word embeddings (RoBERTa). Our objective was to determine whether back-translation improves classification performance across both types of embeddings. In experiments involving Logistic Regression, Support Vector Machine (SVM), Random Forest, and RNN-LSTM classifiers, we evaluated original datasets against those augmented with back-translated data in six languages.

View Article and Find Full Text PDF

Towards cardiac MRI foundation models: Comprehensive visual-tabular representations for whole-heart assessment and beyond.

Med Image Anal

August 2025

School of Computation, Information and Technology, Technical University of Munich, Munich, Germany; School of Medicine, Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany. Electronic address:

Cardiac magnetic resonance (CMR) imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the heart's anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health and to enable the best possible interpretation of an individual's disease risk, CMR and patient-level factors must be jointly exploited within an integrated framework.

View Article and Find Full Text PDF