J Comput Graph Stat
July 2025
Building spatial process models that capture nonstationary behavior while delivering computationally efficient inference is challenging. Nonstationary spatially varying kernels (see, e.g.
View Article and Find Full Text PDFHigh-dimensional datasets, where the number of variables ' ' is much larger than the number of samples ' ', are ubiquitous and often render standard classification techniques unreliable due to overfitting. An important research problem is feature selection, which ranks candidate variables based on their relevance to the outcome variable and retains those that satisfy a chosen criterion. This article proposes a computationally efficient variable selection method based on principal component analysis tailored to a binary classification problem or case-control study.
View Article and Find Full Text PDFThe selection of best variables is a challenging problem in supervised and unsupervised learning, especially in high-dimensional contexts where the number of variables is usually much larger than the number of observations. In this paper, we focus on two multivariate statistical methods: principal components analysis and partial least squares. Both approaches are popular linear dimension-reduction methods with numerous applications in several fields including in genomics, biology, environmental science, and engineering.
View Article and Find Full Text PDFAdv Exp Med Biol
November 2024
We present a gentle introduction to elementary mathematical notation with the focus of communicating deep learning principles. This is a "math crash course" aimed at quickly enabling scientists with understanding of the building blocks used in many equations, formulas, and algorithms that describe deep learning. While this short presentation cannot replace solid mathematical knowledge that needs multiple courses and years to solidify, our aim is to allow nonmathematical readers to overcome hurdles of reading texts that also use such mathematical notation.
View Article and Find Full Text PDFThrough spectral unmixing, hyperspectral imaging (HSI) in fluorescence-guided brain tumor surgery has enabled the detection and classification of tumor regions invisible to the human eye. Prior unmixing work has focused on determining a minimal set of viable fluorophore spectra known to be present in the brain and effectively reconstructing human data without overfitting. With these endmembers, non-negative least squares regression (NNLS) was commonly used to compute the abundances.
View Article and Find Full Text PDFNAR Genom Bioinform
September 2023
Cross-phenotype association using gene-set analysis can help to detect pleiotropic genes and inform about common mechanisms between diseases. Although there are an increasing number of statistical methods for exploring pleiotropy, there is a lack of proper pipelines to apply gene-set analysis in this context and using genome-scale data in a reasonable running time. We designed a user-friendly pipeline to perform cross-phenotype gene-set analysis between two traits using GCPBayes, a method developed by our team.
View Article and Find Full Text PDFReal-time monitoring using in-situ sensors is becoming a common approach for measuring water-quality within watersheds. High-frequency measurements produce big datasets that present opportunities to conduct new analyses for improved understanding of water-quality dynamics and more effective management of rivers and streams. Of primary importance is enhancing knowledge of the relationships between nitrate, one of the most reactive forms of inorganic nitrogen in the aquatic environment, and other water-quality variables.
View Article and Find Full Text PDFCompositional data are a special kind of data, represented as a proportion carrying relative information. Although this type of data is widely spread, no solution exists to deal with the cases where the classes are not well balanced. After describing compositional data imbalance, this paper proposes an adaptation of the original Synthetic Minority Oversampling TEchnique (SMOTE) to deal with compositional data imbalance.
View Article and Find Full Text PDFBMC Med Res Methodol
January 2022
Background: Genome-wide association studies (GWAS) have identified genetic variants associated with multiple complex diseases. We can leverage this phenomenon, known as pleiotropy, to integrate multiple data sources in a joint analysis. Often integrating additional information such as gene pathway knowledge can improve statistical efficiency and biological interpretation.
View Article and Find Full Text PDFInt J Environ Res Public Health
December 2021
In situ sensors that collect high-frequency data are used increasingly to monitor aquatic environments. These sensors are prone to technical errors, resulting in unrecorded observations and/or anomalous values that are subsequently removed and create gaps in time series data. We present a framework based on generalized additive and auto-regressive models to recover these missing data.
View Article and Find Full Text PDFBackground: Heterogeneous respiratory system static compliance (C) values and levels of hypoxemia in patients with novel coronavirus disease (COVID-19) requiring mechanical ventilation have been reported in previous small-case series or studies conducted at a national level.
Methods: We designed a retrospective observational cohort study with rapid data gathering from the international COVID-19 Critical Care Consortium study to comprehensively describe C-calculated as: tidal volume/[airway plateau pressure-positive end-expiratory pressure (PEEP)]-and its association with ventilatory management and outcomes of COVID-19 patients on mechanical ventilation (MV), admitted to intensive care units (ICU) worldwide.
Results: We studied 745 patients from 22 countries, who required admission to the ICU and MV from January 14 to December 31, 2020, and presented at least one value of C within the first seven days of MV.
Background: The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level.
View Article and Find Full Text PDFSemi-Markov models are widely used for survival analysis and reliability analysis. In general, there are two competing parameterizations and each entails its own interpretation and inference properties. On the one hand, a semi-Markov process can be defined based on the distribution of sojourn times, often via hazard rates, together with transition probabilities of an embedded Markov chain.
View Article and Find Full Text PDFAn increasing number of genome-wide association studies (GWAS) summary statistics is made available to the scientific community. Exploiting these results from multiple phenotypes would permit identification of novel pleiotropic associations. In addition, incorporating prior biological information in GWAS such as group structure information (gene or pathway) has shown some success in classical GWAS approaches.
View Article and Find Full Text PDFIntroduction: There is a paucity of data that can be used to guide the management of critically ill patients with COVID-19. In response, a research and data-sharing collaborative-The COVID-19 Critical Care Consortium-has been assembled to harness the cumulative experience of intensive care units (ICUs) worldwide. The resulting observational study provides a platform to rapidly disseminate detailed data and insights crucial to improving outcomes.
View Article and Find Full Text PDFEnviron Sci Technol
November 2020
Anomaly detection (AD) in high-volume environmental data requires one to tackle a series of challenges associated with the typical low frequency of anomalous events, the broad-range of possible anomaly types, and local nonstationary environmental conditions, suggesting the need for flexible statistical methods that are able to cope with unbalanced high-volume data problems. Here, we aimed to detect anomalies caused by technical errors in water-quality (turbidity and conductivity) data collected by automated in situ sensors deployed in contrasting riverine and estuarine environments. We first applied a range of artificial neural networks that differed in both learning method and hyperparameter values, then calibrated models using a Bayesian multiobjective optimization procedure, and selected and evaluated the "best" model for each water-quality variable, environment, and anomaly type.
View Article and Find Full Text PDFIdentification of biomarkers is an emerging area in oncology. In this article, we develop an efficient statistical procedure for the classification of protein markers according to their effect on cancer progression. A high-dimensional time-course dataset of protein markers for 80 patients motivates us for developing the model.
View Article and Find Full Text PDFAnticipating future changes of an ecosystem's dynamics requires knowledge of how its key communities respond to current environmental regimes. The Great Barrier Reef (GBR) is under threat, with rapid changes of its reef-building hard coral (HC) community structure already evident across broad spatial scales. While several underlying relationships between HC and multiple disturbances have been documented, responses of other benthic communities to disturbances are not well understood.
View Article and Find Full Text PDFBackground: In medical research, explanatory continuous variables are frequently transformed or converted into categorical variables. If the coding is unknown, many tests can be used to identify the "optimal" transformation. This common process, involving the problems of multiple testing, requires a correction of the significance level.
View Article and Find Full Text PDFIntegrative analysis of high dimensional omics datasets has been studied by many authors in recent years. By incorporating prior known relationships among the variables, these analyses have been successful in elucidating the relationships between different sets of omics data. In this article, our goal is to identify important relationships between genomic expression and cytokine data from a human immunodeficiency virus vaccine trial.
View Article and Find Full Text PDF