Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction.

Adam Frankish , Barbara Uszczynska , Graham R S Ritchie , Jose M Gonzalez , Dmitri Pervouchine , Robert Petryszak , Jonathan M Mudge , Nuno Fonseca , Alvis Brazma , Roderic Guigo , Jennifer Harrow

BMC Genomics

Published: March 2016

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based.

Results: We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome.

Conclusions: The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4502323	PMC
http://dx.doi.org/10.1186/1471-2164-16-S8-S2	DOI Listing

Publication Analysis

Top Keywords

gencode basic

basic set

gencode refseq

functional annotation

reference transcripts

gencode comprehensive

gencode

refseq

annotation

gene annotation

Similar Publications

The Data Distillery: A Graph Framework for Semantic Integration and Querying of Biomedical Data.

bioRxiv

August 2025

Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA USA.

Taha Mohseni Ahooyi , Benjamin Stear , J Alan Simmons , Vincent T Metzger , Praveen Kumar

The Data Distillery Knowledge Graph (DDKG) is a framework for semantic integration and querying of biomedical data across domains. Built for the NIH Common Fund Data Ecosystem, it supports translational research by linking clinical and experimental datasets in a unified graph model. Clinical standards such as ICD-10, SNOMED, and DrugBank are integrated through UMLS, while genomics and basic science data are structured using ontologies and standards such as HPO, GENCODE, Ensembl, STRING, and ClinVar.

View Article and Find Full Text PDF

Similar Publications

Deep learning analyses of splicing variants identify the link of PCP4 with amyotrophic lateral sclerosis.

Brain

July 2025

State Key Laboratory of Cardiology and Medical Innovation Center, Shanghai East Hospital, Clinical Center for Brain and Spinal Cord Research, School of Medicine, Tongji University, Shanghai 200331, China.

Xuelin Tang , Yan Chen , Yongfei Ren , Wanli Yang , Wendi Yu

Amyotrophic lateral sclerosis (ALS) is a severe motor neuron disease, with most sporadic cases lacking clear genetic causes. Abnormal pre-mRNA splicing is a fundamental mechanism in neurodegenerative diseases. For example, TAR DNA-binding protein 43 (TDP-43) loss of function causes widespread RNA mis-splicing events in ALS.

View Article and Find Full Text PDF

Similar Publications

TEx-MST: tissue expression profiles of MANE select transcripts.

Database (Oxford)

September 2022

Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan, R.O.C.

Kuo-Feng Tung , Wen-Chang Lin

Recently, a new reference transcript dataset [Matched Annotation from the NCBI and EMBL-EBI (MANE) select] was released by NCBI and EMBL-EBI to make available a new unified representative transcript for human protein-coding genes. While the main purpose of MANE project is to provide a harmonized gene and transcript information standard, there is no explicit tissue expression information about these MANE select transcripts. In this report, we tried to provide useful expression profiles of MANE select transcripts in various normal human tissues to allow further interrogation of their molecular modulations and functional significance.

View Article and Find Full Text PDF

Similar Publications

Pipelines for cross-species and genome-wide prediction of long noncoding RNA binding.

Nat Protoc

March 2019

Bioinformatics Section, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China.

Jie Lin , Yujian Wen , Sha He , Xiaoxue Yang , Hai Zhang

Abundant long, noncoding RNAs (lncRNAs) in mammals can bind to DNA sequences and recruit histone- and DNA-modifying enzymes to binding sites to epigenetically regulate target genes. However, most lncRNAs' binding motifs and target sites are unknown. The large numbers of lncRNAs and target sites in the whole genome make it infeasible to examine lncRNA binding to DNA purely experimentally.

View Article and Find Full Text PDF

Similar Publications

Identification of Missing Proteins in Human Olfactory Epithelial Tissue by Liquid Chromatography-Tandem Mass Spectrometry.

J Proteome Res

December 2018

Biomedical Omics Research , Korea Basic Science Institute, Cheongju , Korea.

Heeyoun Hwang , Ji Eun Jeong , Hyun Kyoung Lee , Ki Na Yun , Hyun Joo An

We performed proteomic analyses of human olfactory epithelial tissue to identify missing proteins using liquid chromatography-tandem mass spectrometry. Using a next-generation proteomic pipeline with a < 1.0% false discovery rate at the peptide and protein levels, we identified 3731 proteins, among which five were missing proteins (P0C7M7, P46721, P59826, Q658L1, and Q8N434).

View Article and Find Full Text PDF

Similar Publications