MetaCache: context-aware classification of metagenomic reads using minhashing.

André Müller , Christian Hundt , Andreas Hildebrandt , Thomas Hankeln , Bertil Schmidt

Bioinformatics

Department of Computer Science.

Published: December 2017

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Motivation: Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy.

Results: We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data.

Availability And Implementation: MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache.

Contact: bertil.schmidt@uni-mainz.de.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://dx.doi.org/10.1093/bioinformatics/btx520	DOI Listing

Publication Analysis

Top Keywords

context-aware classification

read classification

reference genomes

kraken clark

classification

metacache context-aware

classification metagenomic

reads

metagenomic reads

reads minhashing

Similar Publications

Retrieval-Augmented Generation for ICD-10 Coding in German Clinical Texts - A Technical Case Report.

Stud Health Technol Inform

September 2025

Department of Computer Science, Kempten University of Applied Sciences, Kempten, Germany.

Mario Krumscheid , Johannes Blömer , Matthias Becker

Introduction: Manual ICD-10 coding of German clinical texts is time-consuming and error-prone. This project aims to develop a semi-automated pipeline for efficient coding of unstructured medical documentation.

State Of The Art: Existing approaches often rely on fine-tuned language models that require large datasets and perform poorly on rare codes, particularly in low-resource languages such as German.

View Article and Find Full Text PDF

Similar Publications

BanglaSarc3: A benchmark dataset for Bangla sarcasm detection from social media to advance Bangla NLP.

Data Brief

October 2025

Department of Computer Science and Engineering, Daffodil International University, Dhaka 1216, Bangladesh.

Susmoy Biswas , Md Mostafizur Rahman Zahid , Mst Taposi Rabeya , Md Minhazul Abedin , Md Hasan Imam Bijoy

Sarcasm is a form of sentiment often used for comedic effect. Its widespread use contributes to frequent misinterpretation of humour-based comments among native Bengali speakers. The growing prevalence of sarcasm in the Bengali language necessitates further study using natural language processing, as detecting Bengali sarcasm remains particularly challenging.

View Article and Find Full Text PDF

Similar Publications

EnergiQ: A Prescriptive Large Language Model-Driven Intelligent Platform for Interpreting Appliance Energy Consumption Patterns.

Sensors (Basel)

August 2025

Management Science and Technology Department, Democritus University of Thrace, 65404 Kavala, Greece.

Christoforos Papaioannou , Ioannis Tzitzios , Alexios Papaioannou , Asimina Dimara , Christos-Nikolaos Anagnostopoulos

The increased usage of smart sensors has introduced both opportunities and complexities in managing residential energy consumption. Despite advancements in sensor data analytics and machine learning (ML), existing energy management systems (EMS) remain limited in interpretability, adaptability, and user engagement. This paper presents EnergiQ, an intelligent, end-to-end platform that leverages sensors and Large Language Models (LLMs) to bridge the gap between technical energy analytics and user comprehension.

View Article and Find Full Text PDF

Similar Publications

Back-translation effects on static and contextual word embeddings for topic classification embedding in classification tasks.

PLoS One

August 2025

Department of Informatics, Faculty of Natural Science and Informatics, Constantine the Philosopher University in Nitra, Nitra, Slovakia.

Dávid Držík , Lívia Kelebercová

This study investigates the impact of back-translation on topic classification, comparing its effects on static word vector representations (FastText) and contextual word embeddings (RoBERTa). Our objective was to determine whether back-translation improves classification performance across both types of embeddings. In experiments involving Logistic Regression, Support Vector Machine (SVM), Random Forest, and RNN-LSTM classifiers, we evaluated original datasets against those augmented with back-translated data in six languages.

View Article and Find Full Text PDF

Similar Publications

Towards cardiac MRI foundation models: Comprehensive visual-tabular representations for whole-heart assessment and beyond.

Med Image Anal

August 2025

School of Computation, Information and Technology, Technical University of Munich, Munich, Germany; School of Medicine, Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany. Electronic address:

Yundi Zhang , Paul Hager , Che Liu , Suprosanna Shit , Chen Chen

Cardiac magnetic resonance (CMR) imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the heart's anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health and to enable the best possible interpretation of an individual's disease risk, CMR and patient-level factors must be jointly exploited within an integrated framework.

View Article and Find Full Text PDF

Similar Publications