ExtractPDF: A data extraction tool for scientific papers applied to a systematic scoping review in public health.

Gunn E Vist , Trine Husøy , Michael Guy Diemar , Hubert Dirven , Erwin L Roggen , Maria E Kalyva

Comput Methods Programs Biomed

Norwegian Institute of Public Health, Division of Climate and Environmental Health, Oslo, Norway. Electronic address:

Published: October 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background And Objectives: Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time-consuming and costly.

Methods: We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.

Results: The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.

Conclusions: ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool's performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.cmpb.2025.108962	DOI Listing

Publication Analysis

Top Keywords

data extraction

public health

extractpdf data

scientific papers

systematic scoping

scoping review

systematic reviews

extraction stage

extractpdf tool

extraction

A PHP Error was encountered