PDF Entity Annotation Tool (PEAT).

J Open Source Softw

Office of Research and Development. United States Environmental Protection Agency.

Published: April 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12180754PMC
http://dx.doi.org/10.21105/joss.05336DOI Listing

Publication Analysis

Top Keywords

entity annotation
8
annotation tool
8
tool peat
8
labeled datasets
8
textual data
8
machine learning
8
domain experts
8
data extraction
8
plain text
8
text
6

Similar Publications

Mammography is a primary method for early screening, and developing deep learning-based computer-aided systems is of great significance. However, current deep learning models typically treat each image as an independent entity for diagnosis, rather than integrating images from multiple views to diagnose the patient. These methods do not fully consider and address the complex interactions between different views, resulting in poor diagnostic performance and interpretability.

View Article and Find Full Text PDF

Background: Lateral malleolar avulsion fracture (LMAF) and subfibular ossicle (SFO) are distinct entities that both present as small bone fragments near the lateral malleolus on imaging, yet require different treatment strategies. Clinical and radiological differentiation is challenging, which can impede timely and precise management. On imaging, magnetic resonance imaging (MRI) is the diagnostic gold standard for differentiating LMAF from SFO, whereas radiological differentiation on computed tomography (CT) alone is challenging in routine practice.

View Article and Find Full Text PDF

Exploring the Fragmentation of Sodiated Species Involving Covalent-Bond Cleavages for Metabolite Characterization.

Rapid Commun Mass Spectrom

September 2025

Département Médicaments et Technologies pour la Santé (DMTS), MetaboHUB, Université Paris-Saclay, CEA, INRAE, Gif sur Yvette, France.

Rationale: Electrospray (ESI), the most popular desorption/ionization technique used in mass spectrometry-based metabolomics, generates both protonated and deprotonated molecules, as well as adduct ions, sodium being the most frequent monoatomic cation entering their composition. With the spread and generalization of untargeted data-dependent and independent tandem mass spectrometry experiments, considering product ion spectra of sodium-containing entities appears relevant to complement fragmentation information of their protonated and deprotonated counterparts.

Methods: Solutions of pure standards, mainly amino and organic acids, were prepared at 1 μg/mL and injected either by direct infusion or by flow-injection prior to ESI-MS/MS analysis.

View Article and Find Full Text PDF

Background: Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.

Objective: This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.

View Article and Find Full Text PDF

Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data).

View Article and Find Full Text PDF