BEETL-fastq: a searchable compressed archive for DNA reads.

Bioinformatics

Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK.

Published: October 2014


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Motivation: FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input.

Results: We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization.

Availability And Implementation: BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btu387DOI Listing

Publication Analysis

Top Keywords

dna reads
8
fastq
5
beetl-fastq searchable
4
searchable compressed
4
compressed archive
4
archive dna
4
reads
4
reads motivation
4
motivation fastq
4
fastq standard
4

Similar Publications

DeepMobilome: predicting mobile genetic elements using sequencing reads of microbiomes.

Brief Bioinform

September 2025

Department of Computer Science, Hanyang University, 222 Wangsimni-ro, Seoul 04763, Republic of Korea.

Motivation: Mobile genetic elements (MGEs) play an important role in facilitating the acquisition of antibiotic resistance genes (ARGs) within microbial communities, significantly impacting the evolution of antibiotic resistance. Understanding the mechanism and trajectory of ARG acquisition requires a comprehensive analysis of the ARG-carrying mobilome-a collective set of MGEs carrying ARGs. However, identifying the mobilome within complex microbiomes poses considerable challenges.

View Article and Find Full Text PDF

The genome stores and processes approximately 1.5 gigabytes of encoded information. In this article, we propose that the eukaryotic genome and its adaptable three-dimensional packing in the form of chromatin offer a valuable template for the system architecture of DNA-based digital computers.

View Article and Find Full Text PDF

The cardiac pacemaker activity is formed from multiple interlocking physiological networks, any one of which can generate rhythm. The interlocking is reciprocal so that they automatically replace each other. In such interlocking control systems, the association scores for individual components are necessarily low, even though causation, measured by the electric current carried by the relevant ion channels, is large.

View Article and Find Full Text PDF

The wrentit (Chamaea fasciata) is a chaparral and scrub specialist bird found from coastal Oregon to northern Baja California. We generated a draft reference assembly for the species using PacBio HiFi long read and Omni-C chromatin-proximity sequencing data as part of the California Conservation Genomics Project (CCGP). Sequenced reads were assembled into 1342 scaffolds totaling 1.

View Article and Find Full Text PDF

Single cell technologies have advanced at a rapid pace, providing assays for various molecular phenotypes. Droplet-based single cell technologies, particularly those based on nuclei isolation, such as simultaneous RNA+ATAC single-cell multiome, are susceptible to exogenous ambient molecule contamination, which can increase noise in cell type-level associations. We reasoned that genotype-based sample multiplexing can provide an opportunity to infer this ambient contamination by leveraging DNA variation in sequenced reads.

View Article and Find Full Text PDF