As the number of cloud platforms supporting scientific research grows, there is an increasing need to support interoperability between two or more cloud platforms. A well accepted core concept is to make data in cloud platforms Findable, Accessible, Interoperable and Reusable (FAIR). We introduce a companion concept that applies to cloud-based computing environments that we call a ecure and uthorized AIR nvironment (SAFE).
View Article and Find Full Text PDFThe Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution.
View Article and Find Full Text PDFDockstore (https://dockstore.org/) is an open source platform for publishing, sharing, and finding bioinformatics tools and workflows. The platform has facilitated large-scale biomedical research collaborations by using cloud technologies to increase the Findability, Accessibility, Interoperability and Reusability (FAIR) of computational resources, thereby promoting the reproducibility of complex bioinformatics analyses.
View Article and Find Full Text PDFThe Pan-Cancer Analysis of Whole Genomes (PCAWG) project generated a vast amount of whole-genome cancer sequencing resource data. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we provide a user's guide to the five publicly available online data exploration and visualization tools introduced in the PCAWG marker paper. These tools are ICGC Data Portal, UCSC Xena, Chromothripsis Explorer, Expression Atlas, and PCAWG-Scout.
View Article and Find Full Text PDFThe need for portable and reproducible genomics analysis pipelines is growing globally as well as in Africa, especially with the growth of collaborative projects like the Human Health and Heredity in Africa Consortium (H3Africa). The Pan-African H3Africa Bioinformatics Network (H3ABioNet) recognized the need for portable, reproducible pipelines adapted to heterogeneous computing environments, and for the nurturing of technical expertise in workflow languages and containerization technologies. Building on the network's Standard Operating Procedures (SOPs) for common genomic analyses, H3ABioNet arranged its first Cloud Computing and Reproducible Workflows Hackathon in 2016, with the purpose of translating those SOPs into analysis pipelines able to run on heterogeneous computing environments and meeting the needs of H3Africa research projects.
View Article and Find Full Text PDFBackground: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments.
View Article and Find Full Text PDFAs genomic datasets continue to grow, the feasibility of downloading data to a local organization and running analysis on a traditional compute environment is becoming increasingly problematic. Current large-scale projects, such as the ICGC PanCancer Analysis of Whole Genomes (PCAWG), the Data Platform for the U.S.
View Article and Find Full Text PDFThe Tyrolean Iceman, a 5,300-year-old Copper age individual, was discovered in 1991 on the Tisenjoch Pass in the Italian part of the Ötztal Alps. Here we report the complete genome sequence of the Iceman and show 100% concordance between the previously reported mitochondrial genome sequence and the consensus sequence generated from our genomic data. We present indications for recent common ancestry between the Iceman and present-day inhabitants of the Tyrrhenian Sea, that the Iceman probably had brown eyes, belonged to blood group O and was lactose intolerant.
View Article and Find Full Text PDFMultiple self-healing squamous epithelioma (MSSE), also known as Ferguson-Smith disease (FSD), is an autosomal-dominant skin cancer condition characterized by multiple squamous-carcinoma-like locally invasive skin tumors that grow rapidly for a few weeks before spontaneously regressing, leaving scars. High-throughput genomic sequencing of a conservative estimate (24.2 Mb) of the disease locus on chromosome 9 using exon array capture identified independent mutations in TGFBR1 in three unrelated families.
View Article and Find Full Text PDFBMC Bioinformatics
December 2010
Background: Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces.
View Article and Find Full Text PDFU87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30x genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library.
View Article and Find Full Text PDFBackground: The emergence of next-generation sequencing technology presents tremendous opportunities to accelerate the discovery of rare variants or mutations that underlie human genetic disorders. Although the complete sequencing of the affected individuals' genomes would be the most powerful approach to finding such variants, the cost of such efforts make it impractical for routine use in disease gene research. In cases where candidate genes or loci can be defined by linkage, association, or phenotypic studies, the practical sequencing target can be made much smaller than the whole genome, and it becomes critical to have capture methods that can be used to purify the desired portion of the genome for shotgun short-read sequencing without biasing allelic representation or coverage.
View Article and Find Full Text PDFPolymorphisms in the interleukin-4 receptor alpha chain (IL-4R alpha) have been linked to asthma incidence and severity, but a causal relationship has remained uncertain. In particular, a glutamine to arginine substitution at position 576 (Q576R) of IL-4R alpha has been associated with severe asthma, especially in African Americans. We show that mice carrying the Q576R polymorphism exhibited intense allergen-induced airway inflammation and remodeling.
View Article and Find Full Text PDFNat Genet
September 2009
Autosomal recessive cutis laxa (ARCL) describes a group of syndromal disorders that are often associated with a progeroid appearance, lax and wrinkled skin, osteopenia and mental retardation. Homozygosity mapping in several kindreds with ARCL identified a candidate region on chromosome 17q25. By high-throughput sequencing of the entire candidate region, we detected disease-causing mutations in the gene PYCR1.
View Article and Find Full Text PDFThe Generic Model Organism Database (GMOD) initiative provides species-agnostic data models and software tools for representing curated model organism data. Here we describe GMODWeb, a GMOD project designed to speed the development of model organism database (MOD) websites. Sites created with GMODWeb provide integration with other GMOD tools and allow users to browse and search through a variety of data types.
View Article and Find Full Text PDFCelsius is a data warehousing system to aggregate Affymetrix CEL files and associated metadata. It provides mechanisms for importing, storing, querying, and exporting large volumes of primary and pre-processed microarray data. Celsius contains ten billion assay measurements and affiliated metadata.
View Article and Find Full Text PDFThe wealth of available genomic data has spawned a corresponding interest in computational methods that can impart biological meaning and context to these experiments. Traditional computational methods have drawn relationships between pairs of proteins or genes based on notions of equality or similarity between their patterns of occurrence or behavior. For example, two genes displaying similar variation in expression, over a number of experiments, may be predicted to be functionally related.
View Article and Find Full Text PDFPLoS Biol
September 2005
Thermophilic organisms flourish in varied high-temperature environmental niches that are deadly to other organisms. Recently, genomic evidence has implicated a critical role for disulfide bonds in the structural stabilization of intracellular proteins from certain of these organisms, contrary to the conventional view that structural disulfide bonds are exclusively extracellular. Here both computational and structural data are presented to explore the occurrence of disulfide bonds as a protein-stabilization method across many thermophilic prokaryotes.
View Article and Find Full Text PDFThe Genomic Disulfide Analysis Program (GDAP) provides web access to computationally predicted protein disulfide bonds for over one hundred microbial genomes, including both bacterial and achaeal species. In the GDAP process, sequences of unknown structure are mapped, when possible, to known homologous Protein Data Bank (PDB) structures, after which specific distance criteria are applied to predict disulfide bonds. GDAP also accepts user-supplied protein sequences and subsequently queries the PDB sequence database for the best matches, scans for possible disulfide bonds and returns the results to the client.
View Article and Find Full Text PDF