Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text.

Bioinform Adv

Department of Biomedical Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, United States.

Published: September 2024


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Motivation: Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. As biomedical literature continues to grow rapidly, there is an increasing need for automated and accurate extraction of these interactions to facilitate scientific discovery. Pretrained language models, such as generative pretrained transformers and bidirectional encoder representations from transformers, have shown promising results in natural language processing tasks.

Results: We evaluated the performance of PPI identification using multiple transformer-based models across three manually curated gold-standard corpora: Learning Language in Logic with 164 interactions in 77 sentences, Human Protein Reference Database with 163 interactions in 145 sentences, and Interaction Extraction Performance Assessment with 335 interactions in 486 sentences. Models based on bidirectional encoder representations achieved the best overall performance, with BioBERT achieving the highest recall of 91.95% and F1 score of 86.84% on the Learning Language in Logic dataset. Despite not being explicitly trained for biomedical texts, GPT-4 showed commendable performance, comparable to the bidirectional encoder models. Specifically, GPT-4 achieved the highest precision of 88.37%, a recall of 85.14%, and an F1 score of 86.49% on the same dataset. These results suggest that GPT-4 can effectively detect protein interactions from text, offering valuable applications in mining biomedical literature.

Availability And Implementation: The source code and datasets used in this study are available at https://github.com/hurlab/PPI-GPT-BERT.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11419952PMC
http://dx.doi.org/10.1093/bioadv/vbae133DOI Listing

Publication Analysis

Top Keywords

bidirectional encoder
12
encoder representations
8
learning language
8
language logic
8
interactions
6
models
5
evaluating gpt
4
gpt bert
4
bert models
4
models protein-protein
4

Similar Publications

Analyzing Reddit Social Media Content in the United States Related to H5N1: Sentiment and Topic Modeling Study.

J Med Internet Res

September 2025

Artificial Intelligence and Mathematical Modeling Lab, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.

Background: The H5N1 avian influenza A virus represents a serious threat to both animal and human health, with the potential to escalate into a global pandemic. Effective monitoring of social media during H5N1 avian influenza outbreaks could potentially offer critical insights to guide public health strategies. Social media platforms like Reddit, with their diverse and region-specific communities, provide a rich source of data that can reveal collective attitudes, concerns, and behavioral trends in real time.

View Article and Find Full Text PDF

The widespread dissemination of fake news presents a critical challenge to the integrity of digital information and erodes public trust. This urgent problem necessitates the development of sophisticated and reliable automated detection mechanisms. This study addresses this gap by proposing a robust fake news detection framework centred on a transformer-based architecture.

View Article and Find Full Text PDF

Knowledge tracing can reveal students' level of knowledge in relation to their learning performance. Recently, plenty of machine learning algorithms have been proposed to exploit to implement knowledge tracing and have achieved promising outcomes. However, most of the previous approaches were unable to cope with long sequence time-series prediction, which is more valuable than short sequence prediction that is extensively utilized in current knowledge-tracing studies.

View Article and Find Full Text PDF

Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.

Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries.

View Article and Find Full Text PDF

In the complex process of gene expression and regulation, RNA-binding proteins occupy a pivotal position for RNA. Accurate prediction of RNA-protein binding sites can help researchers better understand RNA-binding proteins and their related mechanisms. And prediction techniques based on machine learning algorithms are both cost-effective and efficient in identifying these binding sites.

View Article and Find Full Text PDF