On InChI and evaluating the quality of cross-reference links.

J Cheminform

Institute of Organic Chemistry and Biochemistry, Academy of Sciences of the Czech Republic, Flemingovo nam. 2, 166 10 Prague 6, Czech Republic.

Published: May 2014

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.

Results: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.

Conclusions: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4005828	PMC
http://dx.doi.org/10.1186/1758-2946-6-15	DOI Listing

Publication Analysis

Top Keywords

manually curated

curated links

inchi identifiers

automatically generated

completeness manually

inchi

links

cross-reference links

entries databases

consistency completeness

Similar Publications

MAGdb: a comprehensive high quality MAGs repository for exploring microbial metagenome-assemble genomes.

Genome Biol

September 2025

Institute of Translational Medicine, Zhejiang University School of Medicine, Zhejiang, Hangzhou, 310029, China.

Guo Ye , Hao Hong , Ting Li , Jin Li , Jia-Qi Wu

Metagenomic analyses of microbial communities have unveiled a substantial level of interspecies and intraspecies genetic diversity by reconstructing metagenome-assembled genomes (MAGs). The MAG database (MAGdb) boasts an impressive collection of 74 representative research papers, spanning clinical, environmental, and animal categories and comprising 13,702 paired-end run accessions of metagenomic sequencing and 99,672 high quality MAGs with manually curated metadata. MAGdb provides a user-friendly interface that users can browse, search, and download MAGs and their corresponding metadata information.

View Article and Find Full Text PDF

Similar Publications

Hybrid AI in synthetic biology: next era in agriculture.

Trends Plant Sci

September 2025

Crop and Soils Sciences, University of Georgia, Athens, GA 30602, USA; Institute of Plant Breeding and Genetics and Genomics, University of Georgia, Athens, GA 30602, USA.

Mohsen Yoosefzadeh Najafabadi , Scott A Jackson

Synthetic biology holds great potential to transform agriculture, yet its progress is constrained by the complexity of multigenomic, multitrait, and multi-environment data. Desirable traits often arise from complex gene networks acting across diverse conditions, making them difficult to predict and optimize manually. In the past decade, artificial intelligence (AI) has supported this process, but its large data needs and poor integration limit its role to pattern recognition rather than explanatory trait design.

View Article and Find Full Text PDF

Similar Publications

De novo assembled nuclear, chloroplast and mitochondrial genomes show high intraspecific variation in the tropical rainforest species Symphonia globulifera.

G3 (Bethesda)

September 2025

INRAE, UR629 URFM, Ecologie des Forêts Méditerranéennes, Site Agroparc, Domaine Saint Paul, F-84914 Avignon Cedex 9, France.

Sanna Olsson , Rocío Bautista , M Gonzalo Claros , Myriam Heuertz , Ivan Scotti

Symphonia globulifera (Clusiaceae) has emerged as a model organism in tropical forest ecology and evolution due to its significant ecological role and complex biogeographical history. Originating from Africa, this species has independently colonized Caribbean, Central and South America three times, becoming a key component of tropical ecosystems across these regions. Despite the ecological importance of S.

View Article and Find Full Text PDF

Similar Publications

GEMsembler: consensus model assembly and structural comparison of genome-scale metabolic models across tools improve functional performance.

mSystems

September 2025

Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.

Elena K Matveishina , Bartosz J Bartmanski , Sara Benito-Vaquerizo , Maria Zimmermann-Kogadeeva

Genome-scale metabolic models (GEMs) are widely used in systems biology to investigate metabolism and predict perturbation responses. Automatic GEM reconstruction tools generate GEMs with different properties and predictive capacities for the same organism. Since different models can excel at different tasks, combining them can increase metabolic network certainty and enhance model performance.

View Article and Find Full Text PDF

Similar Publications

Exploring Named Entity Recognition Potential and the Value of Tailored Natural Language Processing Pipelines for Radiology, Pathology, and Progress Notes in Clinical Decision Support: Quantitative Study.

JMIR AI

September 2025

Department of Anesteshiology, Perioperative and Pain Medicine, Mount Sinai, New York, NY, United States.

Veysel Kocaman , Fu-Yuan Cheng , Julio Bonis , Ganesh Raut , Prem Timsina

Background: Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.

Objective: This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.

View Article and Find Full Text PDF

Similar Publications