98%
921
2 minutes
20
Motivation: FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input.
Results: We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization.
Availability And Implementation: BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1093/bioinformatics/btu387 | DOI Listing |
Brief Bioinform
September 2025
Department of Computer Science, Hanyang University, 222 Wangsimni-ro, Seoul 04763, Republic of Korea.
Motivation: Mobile genetic elements (MGEs) play an important role in facilitating the acquisition of antibiotic resistance genes (ARGs) within microbial communities, significantly impacting the evolution of antibiotic resistance. Understanding the mechanism and trajectory of ARG acquisition requires a comprehensive analysis of the ARG-carrying mobilome-a collective set of MGEs carrying ARGs. However, identifying the mobilome within complex microbiomes poses considerable challenges.
View Article and Find Full Text PDFAnn N Y Acad Sci
September 2025
Institute of Biological and Chemical Systems, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany.
The genome stores and processes approximately 1.5 gigabytes of encoded information. In this article, we propose that the eukaryotic genome and its adaptable three-dimensional packing in the form of chromatin offer a valuable template for the system architecture of DNA-based digital computers.
View Article and Find Full Text PDFJ Physiol
September 2025
Department of Physiology, Anatomy & Genetics, University of Oxford and Daegu-Gyeongbuk Institute of Science and Technology, South Korea.
The cardiac pacemaker activity is formed from multiple interlocking physiological networks, any one of which can generate rhythm. The interlocking is reciprocal so that they automatically replace each other. In such interlocking control systems, the association scores for individual components are necessarily low, even though causation, measured by the electric current carried by the relevant ion channels, is large.
View Article and Find Full Text PDFJ Hered
September 2025
Museum of Vertebrate Zoology, University of California Berkeley, Berkeley, CA 94720, United States.
The wrentit (Chamaea fasciata) is a chaparral and scrub specialist bird found from coastal Oregon to northern Baja California. We generated a draft reference assembly for the species using PacBio HiFi long read and Omni-C chromatin-proximity sequencing data as part of the California Conservation Genomics Project (CCGP). Sequenced reads were assembled into 1342 scaffolds totaling 1.
View Article and Find Full Text PDFSingle cell technologies have advanced at a rapid pace, providing assays for various molecular phenotypes. Droplet-based single cell technologies, particularly those based on nuclei isolation, such as simultaneous RNA+ATAC single-cell multiome, are susceptible to exogenous ambient molecule contamination, which can increase noise in cell type-level associations. We reasoned that genotype-based sample multiplexing can provide an opportunity to infer this ambient contamination by leveraging DNA variation in sequenced reads.
View Article and Find Full Text PDF