BEETL-fastq: a searchable compressed archive for DNA reads.

Lilian Janin , Ole Schulz-Trieglaff , Anthony J Cox

Bioinformatics

Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK.

Published: October 2014

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Motivation: FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input.

Results: We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization.

Availability And Implementation: BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.

Download full-text PDF	Source
http://dx.doi.org/10.1093/bioinformatics/btu387	DOI Listing

Publication Analysis

Top Keywords

dna reads

fastq

beetl-fastq searchable

searchable compressed

compressed archive

archive dna

reads

reads motivation

motivation fastq

fastq standard

Similar Publications

DeepMobilome: predicting mobile genetic elements using sequencing reads of microbiomes.

Brief Bioinform

September 2025

Department of Computer Science, Hanyang University, 222 Wangsimni-ro, Seoul 04763, Republic of Korea.

Youna Cho , Erin Kim , Minyoung Kim , Mina Rho

Motivation: Mobile genetic elements (MGEs) play an important role in facilitating the acquisition of antibiotic resistance genes (ARGs) within microbial communities, significantly impacting the evolution of antibiotic resistance. Understanding the mechanism and trajectory of ARG acquisition requires a comprehensive analysis of the ARG-carrying mobilome-a collective set of MGEs carrying ARGs. However, identifying the mobilome within complex microbiomes poses considerable challenges.

View Article and Find Full Text PDF

Similar Publications

Chromatin-associated condensates as an inspiration for the system architecture of future DNA computers.

Ann N Y Acad Sci

September 2025

Institute of Biological and Chemical Systems, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany.

Lennart Hilbert , Aaron Gadzekpo , Simon Lo Vecchio , Mona Wellhäusser , Xenia Tschurikow

The genome stores and processes approximately 1.5 gigabytes of encoded information. In this article, we propose that the eukaryotic genome and its adaptable three-dimensional packing in the form of chromatin offer a valuable template for the system architecture of DNA-based digital computers.

View Article and Find Full Text PDF

Similar Publications

The cardiac pacemakers: A paradigm of robustness in evolutionary biology.

J Physiol

September 2025

Department of Physiology, Anatomy & Genetics, University of Oxford and Daegu-Gyeongbuk Institute of Science and Technology, South Korea.

Denis Noble

The cardiac pacemaker activity is formed from multiple interlocking physiological networks, any one of which can generate rhythm. The interlocking is reciprocal so that they automatically replace each other. In such interlocking control systems, the association scores for individual components are necessarily low, even though causation, measured by the electric current carried by the relevant ion channels, is large.

View Article and Find Full Text PDF

Similar Publications

A Highly Contiguous Genome Assembly for the Wrentit (Chamaea Fasciata), the Sole Representative of the Babbler Radiation in the Americas.

J Hered

September 2025

Museum of Vertebrate Zoology, University of California Berkeley, Berkeley, CA 94720, United States.

Phred M Benham , Carla Cicero , Kevin Burns , Merly Escalona , Eric Beraut

The wrentit (Chamaea fasciata) is a chaparral and scrub specialist bird found from coastal Oregon to northern Baja California. We generated a draft reference assembly for the species using PacBio HiFi long read and Omni-C chromatin-proximity sequencing data as part of the California Conservation Genomics Project (CCGP). Sequenced reads were assembled into 1342 scaffolds totaling 1.

View Article and Find Full Text PDF

Similar Publications

Integrated ambient modeling and genetic demultiplexing of single-cell RNA+ATAC multiome experiments with Ambimux.

bioRxiv

August 2025

Marcus Alvarez , Terence Li , Seung Hyuk T Lee , Uma Thanigai Arasu , Ilakya Selvarajan

Single cell technologies have advanced at a rapid pace, providing assays for various molecular phenotypes. Droplet-based single cell technologies, particularly those based on nuclei isolation, such as simultaneous RNA+ATAC single-cell multiome, are susceptible to exogenous ambient molecule contamination, which can increase noise in cell type-level associations. We reasoned that genotype-based sample multiplexing can provide an opportunity to infer this ambient contamination by leveraging DNA variation in sequenced reads.

View Article and Find Full Text PDF

Similar Publications