DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates.

Ben Cardoen , Hanene Ben Yedder , Sieun Lee , Ivan Robert Nabi , Ghassan Hamarneh

Bioinform Adv

Department of Computing Science, Simon Fraser University, 8888 University Dr W, Burnaby, British Columbia V5A1S6, Canada.

Published: June 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Large-scale processing of heterogeneous datasets in interdisciplinary research often requires time-consuming manual data curation. Ambiguity in the data layout and preprocessing conventions can easily compromise reproducibility and scientific discovery, and even when detected, it requires time and effort to be corrected by domain experts. Poor data curation can also interrupt processing jobs on large computing clusters, causing frustration and delays. We introduce , a portable software package that verifies arbitrarily complex datasets of mixed formats, working equally well on clusters as on local systems. Human-readable TOML recipes are converted into executable, machine-verifiable templates, enabling users to easily verify datasets using custom rules without writing code. Recipes can be used to transform and validate data, for pre- or post-processing, selection of data subsets, sampling and aggregation, such as summary statistics. Processing pipelines no longer need to be burdened by laborious data validation, with data curation and validation replaced by human and machine-verifiable recipes specifying rules and actions. Multithreaded execution ensures scalability on clusters, and existing Julia, R and Python libraries can be reused. enables efficient remote workflows, offering integration with Slack and the ability to transfer curated data to clusters using OwnCloud and SCP. Code available at: https://github.com/bencardoen/DataCurator.jl.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10290225	PMC
http://dx.doi.org/10.1093/bioadv/vbad068	DOI Listing

Publication Analysis

Top Keywords

data curation

heterogeneous datasets

machine-verifiable templates

data

datacuratorjl efficient

efficient portable

portable reproducible

reproducible validation

curation

validation curation

Similar Publications

Time below range alone is insufficient to identify severe hypoglycaemia risk in type 1 diabetes-the critical role of hypoglycaemia awareness: results from the SFDT1 study.

Diabetologia

September 2025

Centre Universitaire de Diabétologie et de ses Complications, AP-HP, Hôpital Lariboisière, Paris, France.

Dulce Canha , Pratik Choudhary , Emmanuel Cosson , Isabela Banu , Sara Barraud

Aims/hypothesis: Severe hypoglycaemia events (SHE) remain frequent in people with type 1 diabetes despite advanced diabetes technologies. We examined whether time below range (TBR) 3.9 mmol/l (70 mg/dl; TBR70) or 3.

View Article and Find Full Text PDF

Similar Publications

Harnessing historical genebank data to accelerate pea breeding.

Theor Appl Genet

September 2025

Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.

Lanique Niels , Jochen Christoph Reif , Lars-Gernot Otto , Vilson Mirdita , Markus Oppermann

The German Federal Ex Situ Genebank for Agricultural and Horticultural Crops (IPK) harbours over 3000 pea plant genetic resources (PGRs), backed up by corresponding information across 16 key agronomic and economical traits. The unbalanced structure and inconsistent format of this historical data has precluded effective leverage of genebank accessions, despite the opportunities contained in its genetic diversity. Therefore, a three-step statistical approach founded in linear mixed models was implemented to enable a rigorous and targeted data curation.

View Article and Find Full Text PDF

Similar Publications

PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modeling and machine learning.

Acta Crystallogr F Struct Biol Commun

October 2025

Science and Technology Facilities Council, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom.

Beatriz Costa-Gomes , Joel Greer , Nikolai Juraschko , James Parkhurst , Jola Mirecka

Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated data sets. Being able to easily access and utilize these is crucial to allow researchers to make optimal use of their research effort.

View Article and Find Full Text PDF

Similar Publications

Published Population Pharmacokinetic Models of Imatinib Perform Poorly on TDM Data from Pediatric Patients.

Target Oncol

September 2025

Department of Drug Design and Pharmacology, University of Copenhagen, Copenhagen, Denmark.

Tianwu Yang , Anna Sofie Buhl Rasmussen , Allan Weimann , Maria Thastrup , Cecilie Utke Rank

Background: Population pharmacokinetic models can potentially provide suggestions for an initial dose and the magnitude of dose adjustment during therapeutic drug monitoring procedures of imatinib. Several population pharmacokinetic models for imatinib have been developed over the last two decades. However, their predictive performance is still unknown when extrapolated to different populations, especially children.

View Article and Find Full Text PDF

Similar Publications

GEMsembler: consensus model assembly and structural comparison of genome-scale metabolic models across tools improve functional performance.

mSystems

September 2025

Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.

Elena K Matveishina , Bartosz J Bartmanski , Sara Benito-Vaquerizo , Maria Zimmermann-Kogadeeva

Genome-scale metabolic models (GEMs) are widely used in systems biology to investigate metabolism and predict perturbation responses. Automatic GEM reconstruction tools generate GEMs with different properties and predictive capacities for the same organism. Since different models can excel at different tasks, combining them can increase metabolic network certainty and enhance model performance.

View Article and Find Full Text PDF

Similar Publications