Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based.

Results: We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome.

Conclusions: The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4502323PMC
http://dx.doi.org/10.1186/1471-2164-16-S8-S2DOI Listing

Publication Analysis

Top Keywords

gencode basic
16
basic set
16
gencode refseq
12
functional annotation
12
reference transcripts
12
gencode comprehensive
12
gencode
9
refseq
8
annotation
8
gene annotation
8

Similar Publications

The Data Distillery Knowledge Graph (DDKG) is a framework for semantic integration and querying of biomedical data across domains. Built for the NIH Common Fund Data Ecosystem, it supports translational research by linking clinical and experimental datasets in a unified graph model. Clinical standards such as ICD-10, SNOMED, and DrugBank are integrated through UMLS, while genomics and basic science data are structured using ontologies and standards such as HPO, GENCODE, Ensembl, STRING, and ClinVar.

View Article and Find Full Text PDF

Deep learning analyses of splicing variants identify the link of PCP4 with amyotrophic lateral sclerosis.

Brain

July 2025

State Key Laboratory of Cardiology and Medical Innovation Center, Shanghai East Hospital, Clinical Center for Brain and Spinal Cord Research, School of Medicine, Tongji University, Shanghai 200331, China.

Amyotrophic lateral sclerosis (ALS) is a severe motor neuron disease, with most sporadic cases lacking clear genetic causes. Abnormal pre-mRNA splicing is a fundamental mechanism in neurodegenerative diseases. For example, TAR DNA-binding protein 43 (TDP-43) loss of function causes widespread RNA mis-splicing events in ALS.

View Article and Find Full Text PDF

Recently, a new reference transcript dataset [Matched Annotation from the NCBI and EMBL-EBI (MANE) select] was released by NCBI and EMBL-EBI to make available a new unified representative transcript for human protein-coding genes. While the main purpose of MANE project is to provide a harmonized gene and transcript information standard, there is no explicit tissue expression information about these MANE select transcripts. In this report, we tried to provide useful expression profiles of MANE select transcripts in various normal human tissues to allow further interrogation of their molecular modulations and functional significance.

View Article and Find Full Text PDF

Abundant long, noncoding RNAs (lncRNAs) in mammals can bind to DNA sequences and recruit histone- and DNA-modifying enzymes to binding sites to epigenetically regulate target genes. However, most lncRNAs' binding motifs and target sites are unknown. The large numbers of lncRNAs and target sites in the whole genome make it infeasible to examine lncRNA binding to DNA purely experimentally.

View Article and Find Full Text PDF

We performed proteomic analyses of human olfactory epithelial tissue to identify missing proteins using liquid chromatography-tandem mass spectrometry. Using a next-generation proteomic pipeline with a < 1.0% false discovery rate at the peptide and protein levels, we identified 3731 proteins, among which five were missing proteins (P0C7M7, P46721, P59826, Q658L1, and Q8N434).

View Article and Find Full Text PDF