An Automated Approach for Domain-Specific Knowledge Graph Generation─Graph Measures and Characterization.

Connor O'Ryan , Kevin D Hayes , Francis G VanGessel , Ruth M Doherty , William Wilson , John Fischer , Zois Boukouvalas , Peter W Chung

J Chem Inf Model

Center for Engineering Concepts Development, Department of Mechanical Engineering, University of Maryland, College Park, Maryland 20742, United States.

Published: February 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

In 2020, nearly 3 million scientific and engineering papers were published worldwide (White, K. Publications Output: U.S. Trends And International Comparisons). The vastness of the literature that already exists, the increasing rate of appearance of new publications, and the timely translation of artificial intelligence methods into scientific and engineering communities have ushered in the development of automated methods for mining and extracting information from technical documents. However, domain-specific approaches for extracting knowledge graph representations from semantic information remain limited. In this paper, we develop a natural language processing (NLP) approach to extract knowledge graphs resulting in a semantically structured network (SSN) that can be queried. After a detailed exposition of the modeling method, the approach is demonstrated specifically for the synthetic chemistry of organic molecules from the text of approximately 100,000 full-length patents. In this paper, we focus specifically on characterizing the knowledge graph to develop insights into the linguistic patterns and trends within the data and to establish objective graph characteristics that may enable comparisons among other text-based knowledge graphs across domains. Graph characterization is performed for network motif structures, assortativity, and eigenvector centrality. The structural information provided by the measures reveals language tendencies commonly employed by authors in the text discourse for chemical reactions. These include observations of the prevalence of descriptions of specific compound names, that common solvents and drying agents cut across large numbers of chemical synthesis approaches, and that power-law trends clearly emerge in the limit of larger corpora. The findings provide important quantitative characterizations of knowledge graphs for use in validation in large data settings.

Download full-text PDF	Source
http://dx.doi.org/10.1021/acs.jcim.4c01904	DOI Listing

Publication Analysis

Top Keywords

knowledge graph

knowledge graphs

scientific engineering

knowledge

graph

automated approach

approach domain-specific

domain-specific knowledge

graph generation─graph

generation─graph measures

Similar Publications

Oral bioavailability property prediction based on task similarity transfer learning.

Mol Divers

September 2025

Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, Nanjing, 211198, China.

Chen Zeng , Chengcheng Xu , Yingxu Liu , Yunya Jiang , Lidan Zheng

Drug absorption significantly influences pharmacokinetics. Accurately predicting human oral bioavailability (HOB) is essential for optimizing drug candidates and improving clinical success rates. The traditional method based on experiment is a common way to obtain HOB, but the experimental method is time-consuming and costly.

View Article and Find Full Text PDF

Similar Publications

Leveraging Language Model, Crystal Structure Prediction and First-Principles Calculation for Material Design.

J Chem Inf Model

September 2025

Songshan Lake Materials Laboratory, Dongguan 523808, PR China.

Lei Zhang , Ben Ni , Kaiyang Xu , Yiru Huang , Qingfang Li

Large language models (LLMs) have demonstrated transformative potential for materials discovery in condensed matter systems, but their full utility requires both broader application scenarios and integration with ab initio crystal structure prediction (CSP), density functional theory (DFT) methods and domain knowledge to benefit future inverse material design. Here, we develop an integrated computational framework combining language model-guided materials screening with genetic algorithm (GA) and graph neural network (GNN)-based CSP methods to predict new photovoltaic material. This LLM + CSP + DFT approach successfully identifies a previously overlooked oxide material with unexpected photovoltaic potential.

View Article and Find Full Text PDF

Similar Publications

A Pure Transformer Pretraining Framework on Text-attributed Graphs.

Proc Mach Learn Res

November 2024

Michigan State University.

Yu Song , Haitao Mao , Jiachen Xiao , Jingzhe Liu , Zhikai Chen

Pretraining plays a pivotal role in acquiring generalized knowledge from large-scale data, achieving remarkable successes as evidenced by large models in CV and NLP. However, progress in the graph domain remains limited due to fundamental challenges represented by feature heterogeneity and structural heterogeneity. Recent efforts have been made to address feature heterogeneity via Large Language Models (LLMs) on text-attributed graphs (TAGs) by generating fixed-length text representations as node features.

View Article and Find Full Text PDF

Similar Publications

Enhancing Knowledge Retention by Simulation-Based Learning Among First-Year Medical Students.

Cureus

August 2025

Physiology, SGT University, Gurugram, IND.

Nimarpreet Kaur , Bhupendra Yadav , Deepti Dwivedi , Harminder Kaur , Pragyashaa Chaudhary

Introduction Simulation-based training has been a vital part of medical education since Competency-Based Medical Education (CBME) was introduced, and new guidelines since 2023 have expanded to include simulation as a mandatory methodology of teaching. This method enables learners to build and develop both technical and non-technical abilities in a safe and controlled setting, enhancing their preparedness for real-life medical scenarios. Simulation-based training improves skill acquisition and retention and enhances learners' confidence, reduces anxiety, reinforces learning, corrects errors, and promotes reflective practice, in contrast with the traditional method of teaching.

View Article and Find Full Text PDF

Similar Publications

KG-MACNF: A nonlinear cross-modal fusion model for predicting drug-target interactions via multi-relational embedding and fine-grained structure.

PLoS One

September 2025

School of Mechanical and Automotive Engineering, Qingdao University of Technology, Qingdao, Shandong, China.

Yihan Feng , Xixin Yang , Yuanlin Guan , Jinyao Zhang , Hang Yang

Drug-target interaction (DTI) prediction is essential for the development of novel drugs and the repurposing of existing ones. However, when the features of drug and target are applied to biological networks, there is a lack of capturing the relational features of drug-target interactions. And the corresponding multimodal models mainly depend on shallow fusion strategies, which results in suboptimal performance when trying to capture complex interaction relationships.

View Article and Find Full Text PDF

Similar Publications