VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications.

Bioinformatics

Department of Electrical Engineering, Stanford University, Stanford, CA 94035, USA, Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065, USA, Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA, Mayo Clinics, Department of Health Science

Published: May 2015


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Summary: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing.

Availability And Implementation: Code in Java and Python along with instructions to download the reads and variants is at http://bioinform.github.io/varsim.

Contact: rd@bina.com

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4410653PMC
http://dx.doi.org/10.1093/bioinformatics/btu828DOI Listing

Publication Analysis

Top Keywords

high-throughput genome
8
genome sequencing
8
variants
5
varsim high-fidelity
4
high-fidelity simulation
4
simulation validation
4
validation framework
4
framework high-throughput
4
sequencing cancer
4
cancer applications
4

Similar Publications

Background: Lung cancer (LC) is the leading cause of cancer-related deaths globally. Genetic variants in mismatch repair (MMR) genes, such as MutS homolog 2 (MSH2), MutS homolog 6 (MSH6) and MutL homolog 1 (MLH1), may influence individual susceptibility and clinical outcomes in LC.

Objective: This study investigated the associations of genetic polymorphisms in MSH2, MSH6, and MLH1 with susceptibility and survival outcomes in lung cancer patients in the Guangxi Zhuang population.

View Article and Find Full Text PDF

Replication-competent adenovirus reporters utilizing endogenous viral expression architecture.

J Virol

September 2025

Genome Regulation and Cell Signaling, Ellen and Ronald Caplan Cancer Center, The Wistar Institute, Philadelphia, Pennsylvania, USA.

Unlabelled: Adenoviruses are double-stranded DNA viruses widely used as platforms for vaccines, oncolytics, and gene delivery. However, tools for studying adenoviral gene expression in real time during infection remain limited. Here, we describe a set of fluorescent and bioluminescent reporter viruses built using the modular AdenoBuilder reverse genetics system and informed by high-resolution maps of Ad5 transcription.

View Article and Find Full Text PDF

A significant challenge in the field of microbiology is the functional annotation of novel genes from microbiomes. The increasing pace of sequencing technology development has made solving this challenge in a high-throughput manner even more important. Functional metagenomics offers a sequence-naive and cultivation-independent solution.

View Article and Find Full Text PDF

Background: Gastric cancer (GC) is the fourth leading cause of cancer-related death globally. Tumor profiling has revealed actionable gene alterations that guide treatment strategies and enhance survival. Among Hispanics living in Puerto Rico (PRH), GC ranks among the top 10 causes of cancer-related death.

View Article and Find Full Text PDF

Background: Recent advances in high-throughput sequencing technologies have enabled the collection and sharing of a massive amount of omics data, along with its associated metadata-descriptive information that contextualizes the data, including phenotypic traits and experimental design. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability and limit secondary analyses.

View Article and Find Full Text PDF