Whole-genome long-read sequencing downsampling and its effect on variant-calling precision and recall.

William T Harvey , Peter Ebert , Jana Ebler , Peter A Audano , Katherine M Munson , Kendra Hoekzema , David Porubsky , Christine R Beck , Tobias Marschall , Kiran Garimella , Evan E Eichler

Genome Res

Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195-5065, USA;

Published: December 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Advances in long-read sequencing (LRS) technologies continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant-calling precision and recall of Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant-calling precision and recall of SVs and indels in HiFi data sets with HiFi outperforming ONT in quality as measured by the F score of assembly-based variant call sets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10760522	PMC
http://dx.doi.org/10.1101/gr.278070.123	DOI Listing

Publication Analysis

Top Keywords

variant-calling precision

precision recall

long-read sequencing

technologies continue

genome assembly

whole-genome long-read

sequencing

sequencing downsampling

downsampling variant-calling

recall advances

Similar Publications

TMBquant: an explainable AI-powered caller advancing tumor mutation burden quantification across heterogeneous samples.

Brief Bioinform

August 2025

Department of Respiratory Medicine, The Second Affiliated Hospital of Xi'an Jiaotong University, No. 157, Xiwu Road, Xincheng District, Xi'an 710004, China.

Shenjie Wang , Xiaonan Wang , Xiaoyan Zhu , Xuwen Wang , Yuqian Liu

Accurate tumor mutation burden (TMB) quantification is critical for immunotherapy stratification, yet remains challenging due to variability across sequencing platforms, tumor heterogeneity, and variant calling pipelines. Here, we introduce TMBquant, an explainable AI-powered caller designed to optimize TMB estimation through dynamic feature selection, ensemble learning, and automated strategy adaptation. Built upon the H2O AutoML framework, TMBquant integrates variant features, minimizes classification errors, and enhances both accuracy and stability across diverse datasets.

View Article and Find Full Text PDF

Similar Publications

Performance comparison of germline variant calling tools in sporadic disease cohorts.

Mol Genet Genomics

September 2025

Human Phenome Institute, MOE Key Laboratory of Contemporary Anthropology, Zhangjiang Fudan International Innovation Center, Fudan University, 825 Zhangheng Road, Shanghai, 201203, China.

Qiaofeng Song , Jinglan Zhai , Changshui Chen , Haibo Li , Aihua Cao

Accurate variant calling is essential for next-generation sequencing (NGS)-based diagnosis of rare diseases, yet most benchmarking studies have focused on standard cell lines or trio-based samples, with limited relevance to sporadic cases. Here, we systematically compared the performance of DeepVariant and GATK HaplotypeCaller in two Chinese cohorts of patients with sporadic epilepsy (EP) and autism spectrum disorder (ASD). DeepVariant exhibited higher precision and sensitivity in detecting single nucleotide variants (SNVs), while GATK showed a distinct advantage in identifying rare variants, which are often key to understanding the genetic basis of rare diseases.

View Article and Find Full Text PDF

Similar Publications

Modular and cloud-based bioinformatics pipelines for high-confidence biomarker detection in cancer immunotherapy clinical trials.

PLoS One

August 2025

The Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, Maryland, United States of America.

Cu Nguyen , Trinh Nguyen , Gloria Trivitt , Brian Capaldo , Chunhua Yan

Background: The Cancer Immune Monitoring and Analysis Centers - Cancer Immunologic Data Center (CIMAC-CIDC) network aims to improve cancer immunotherapy by providing harmonized molecular assays and standardized bioinformatics analysis.

Results: In response to evolving bioinformatics standards and the migration of the CIDC to the National Cancer Institute (NCI), we undertook the enhancement of the CIDC's extant whole exome sequencing (WES) and RNA sequencing (RNA-Seq) pipelines. Leveraging open-source tools and cloud-based technologies, we implemented modular workflows using Snakemake and Docker for efficient deployment on the Google Cloud Platform (GCP).

View Article and Find Full Text PDF

Similar Publications

RnaXtract, a tool for extracting gene expression, variants, and cell-type composition from bulk RNA sequencing.

Sci Rep

August 2025

Endocrinology and Nephrology Axis, Centre de recherche du CHU de Québec-Université Laval, Québec, Québec, G1V 4G2, Canada.

Sophiane G Bouirdene , Simon Gotty , Mickaël Leclercq , Charles Joly-Beauparlant , Emeric Texeraud

RNA sequencing (RNA-seq) is a widely used method in transcriptomics research, offering insights into gene expression, variant discovery, and, when deconvoluted, the cellular composition of complex tissues. However, existing RNA-seq pipelines frequently emphasize gene expression analysis and often lack cell deconvolution and variant calling. To address these limitations, we present RnaXtract, a comprehensive and user-friendly pipeline designed to maximize extraction of valuable information from bulk RNA-seq data.

View Article and Find Full Text PDF

Similar Publications

SV-MeCa: an XGBoost-based meta-caller approach for structural variant calling from short-read data.

BMC Bioinformatics

August 2025

Cologne Center for Genomics, University of Cologne, Medical Faculty, Cologne, Germany.

Rudel Christian Nkouamedjo Fankep , Arda Söylev , Anna-Lena Kobiela , Jochen Blom , Corinna Ernst

Background: Calling structural variants (SVs), i.e., genomic alterations of ≥50bp, from whole genome short-read data remains challenging, as existing callers are known to lack accuracy and robustness.

View Article and Find Full Text PDF

Similar Publications