Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning.

Luca Cappelletti , Lauren Rekerle , Tommaso Fontana , Peter Hansen , Elena Casiraghi , Vida Ravanmehr , Christopher J Mungall , Jeremy J Yang , Leonard Spranger , Guy Karlebach , J Harry Caufield , Leigh Carmody , Ben Coleman , Tudor I Oprea , Justin Reese , Giorgio Valentini , Peter N Robinson

Bioinform Adv

The Jackson Laboratory for Genomic Medicine, CT 06032, United States.

Published: March 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes.

Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement.

Availability And Implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10994718	PMC
http://dx.doi.org/10.1093/bioadv/vbae036	DOI Listing

Publication Analysis

Top Keywords

classification performance

random walk-based

walk-based graph

graph representation

representation learning

edges represent

negative edges

graph

sampling

node-degree aware

Similar Publications

Use of artificial intelligence for classification of fractures around the elbow in adults according to the 2018 AO/OTA classification system.

BMC Musculoskelet Disord

September 2025

Department of Clinical Sciences at Danderyds Hospital, Department of Orthopedic Surgery, Karolinska Institutet, Stockholm, 182 88, Sweden.

Annelie Pettersson , Michael Axenhus , Teo Stukan , Oscar Ljungberg , Hans Nåsell

Background: This study evaluates the accuracy of an Artificial Intelligence (AI) system, specifically a convolutional neural network (CNN), in classifying elbow fractures using the detailed 2018 AO/OTA fracture classification system.

Methods: A retrospective analysis of 5,367 radiograph exams visualizing the elbow from adult patients (2002-2016) was conducted using a deep neural network. Radiographs were manually categorized according to the 2018 AO/OTA system by orthopedic surgeons.

View Article and Find Full Text PDF

Similar Publications

Machine learning-based assessment of land use change effects on land surface temperature fluctuations in Ho Chi Minh city, Vietnam.

Environ Monit Assess

September 2025

Institute of Earth Sciences, Southern Federal University, Rostov-On-Don, Russia.

Bui B Thien , Vu T Phuong , Ioshpa R Alexsander , Krivoguz O Denis

Sustainable urban development requires actionable insights into the thermal consequences of land transformation. This study examines the impact of land use and land cover (LULC) changes on land surface temperature (LST) in Ho Chi Minh city, Vietnam, between 1998 and 2024. Using Google Earth Engine (GEE), three machine learning algorithms-random forest (RF), support vector machine (SVM), and classification and regression tree (CART)-were applied for LULC classification.

View Article and Find Full Text PDF

Similar Publications

A machine learning approach to concussive group classification using discrete outcome measures from a low-cost movement-based assessment system.

Med Eng Phys

October 2025

University of Missouri, Department of Physical Therapy, Columbia, MO, USA. Electronic address:

Jacob M Thomas , Jamie B Hall , Rebecca Bliss , Emily Leary , Stephen P Sayers

Measurable neuromotor control deficits during functional task performance could provide objective criteria to aid in concussion diagnosis. However, many tools which measure these constructs are unidimensional and not clinically feasible. The purpose of this study was to assess the classification accuracy of a machine learning model using features measured by a clinically feasible movement-based assessment system (Mizzou Point-of-care Assessment System (MPASS) between athletes with and without concussion.

View Article and Find Full Text PDF

Similar Publications

Analyzing Reddit Social Media Content in the United States Related to H5N1: Sentiment and Topic Modeling Study.

J Med Internet Res

September 2025

Artificial Intelligence and Mathematical Modeling Lab, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.

Oscar Pang , Zahra Movahedi Nia , Murray Gillies , Doris Leung , Nicola Bragazzi

Background: The H5N1 avian influenza A virus represents a serious threat to both animal and human health, with the potential to escalate into a global pandemic. Effective monitoring of social media during H5N1 avian influenza outbreaks could potentially offer critical insights to guide public health strategies. Social media platforms like Reddit, with their diverse and region-specific communities, provide a rich source of data that can reveal collective attitudes, concerns, and behavioral trends in real time.

View Article and Find Full Text PDF

Similar Publications

Machine learning based classification of imagined speech electroencephalogram data from the amplitude and phase spectrum of frequency domain EEG signal.

Biomed Phys Eng Express

September 2025

electrical engineering department, Indian Institute of Technology Roorkee, Research wing, electrical department, Roorkee, uttrakhand, 247664, INDIA.

Meenakshi Bisla , Radhey Shyam Anand

Imagined speech classification involves decoding brain signals to recognize verbalized thoughts or intentions without actual speech production. This technology has significant implications for individuals with speech impairments, offering a means to communicate through neural signals. The prime objective of this work is to propose an innovative machine learning (ML) based classification methodology that combines electroencephalogram (EEG) data augmentation using a sliding window technique with statistical feature extraction from the amplitude and phase spectrum of frequency domain EEG segments.

View Article and Find Full Text PDF

Similar Publications