98%
921
2 minutes
20
Motivation: Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy.
Results: We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data.
Availability And Implementation: MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache.
Contact: bertil.schmidt@uni-mainz.de.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1093/bioinformatics/btx520 | DOI Listing |
Stud Health Technol Inform
September 2025
Department of Computer Science, Kempten University of Applied Sciences, Kempten, Germany.
Introduction: Manual ICD-10 coding of German clinical texts is time-consuming and error-prone. This project aims to develop a semi-automated pipeline for efficient coding of unstructured medical documentation.
State Of The Art: Existing approaches often rely on fine-tuned language models that require large datasets and perform poorly on rare codes, particularly in low-resource languages such as German.
Data Brief
October 2025
Department of Computer Science and Engineering, Daffodil International University, Dhaka 1216, Bangladesh.
Sarcasm is a form of sentiment often used for comedic effect. Its widespread use contributes to frequent misinterpretation of humour-based comments among native Bengali speakers. The growing prevalence of sarcasm in the Bengali language necessitates further study using natural language processing, as detecting Bengali sarcasm remains particularly challenging.
View Article and Find Full Text PDFSensors (Basel)
August 2025
Management Science and Technology Department, Democritus University of Thrace, 65404 Kavala, Greece.
The increased usage of smart sensors has introduced both opportunities and complexities in managing residential energy consumption. Despite advancements in sensor data analytics and machine learning (ML), existing energy management systems (EMS) remain limited in interpretability, adaptability, and user engagement. This paper presents EnergiQ, an intelligent, end-to-end platform that leverages sensors and Large Language Models (LLMs) to bridge the gap between technical energy analytics and user comprehension.
View Article and Find Full Text PDFPLoS One
August 2025
Department of Informatics, Faculty of Natural Science and Informatics, Constantine the Philosopher University in Nitra, Nitra, Slovakia.
This study investigates the impact of back-translation on topic classification, comparing its effects on static word vector representations (FastText) and contextual word embeddings (RoBERTa). Our objective was to determine whether back-translation improves classification performance across both types of embeddings. In experiments involving Logistic Regression, Support Vector Machine (SVM), Random Forest, and RNN-LSTM classifiers, we evaluated original datasets against those augmented with back-translated data in six languages.
View Article and Find Full Text PDFMed Image Anal
August 2025
School of Computation, Information and Technology, Technical University of Munich, Munich, Germany; School of Medicine, Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany. Electronic address:
Cardiac magnetic resonance (CMR) imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the heart's anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health and to enable the best possible interpretation of an individual's disease risk, CMR and patient-level factors must be jointly exploited within an integrated framework.
View Article and Find Full Text PDF