Publications by authors named "Abhirup Datta"

Cholera outbreaks cause substantial morbidity and mortality in Africa, yet changes in the geographic distribution of cholera burden over time remain uncharacterized. We used surveillance data and spatial statistical models to estimate the mean annual incidence of reported suspected cholera for 2011-2015 and 2016-2020 on a 20-km grid across Africa. Across 43 countries, mean annual incidence rates remained at 11 cases per 100,000 population, with 125,701 cases estimated annually (95% credible interval (CrI): 124,737-126,717) from 2016 to 2020.

View Article and Find Full Text PDF

Introduction: Noncommunicable diseases (NCDs) lead to huge mortality in the population under 70 years of age at a global level. A national program called the National Programme for Prevention and Control of NCDs (NPNCD), targeting mainly individuals over 30 years of age, has been launched in India. Nearly 200 million Indians are young adults (ages 18-30 years).

View Article and Find Full Text PDF

Analysis of geospatial data has traditionally been model-based, with a mean model, customarily specified as a linear regression on the covariates, and a Gaussian process covariance model, encoding the spatial dependence. While nonlinear machine learning algorithms like neural networks are increasingly being used for spatial analysis, current approaches depart from the model-based setup and cannot explicitly incorporate spatial covariance. We propose , embedding neural networks directly within the traditional Gaussian process (GP) geostatistical model to accommodate non-linear mean functions while retaining all other advantages of GP, like explicit modeling of the spatial covariance and predicting at new locations via kriging.

View Article and Find Full Text PDF

The manuscript considers multivariate functional data analysis with a known graphical model among the functional variables representing their conditional relationships (e.g., brain region-level fMRI data with a prespecified connectivity graph among brain regions).

View Article and Find Full Text PDF

The ever increasing popularity of machine learning methods in virtually all areas of science, engineering and beyond is poised to put established statistical modeling approaches into question. Environmental statistics is no exception, as popular constructs such as neural networks and decision trees are now routinely used to provide forecasts of physical processes ranging from air pollution to meteorology. This presents both challenges and opportunities to the statistical community, which could contribute to the machine learning literature with a model-based approach with formal uncertainty quantification.

View Article and Find Full Text PDF
Article Synopsis
  • This manuscript introduces a novel method for scalar-on-distribution regression, where subject-specific distributions serve as covariates to predict a single outcome, bypassing the need for prior estimation of these distributions.
  • The proposed approach uses observed repeated measures directly as covariates and applies a Gaussian process prior, achieving efficient Bayesian inference without needing intermediate density estimates.
  • The method shows superior performance in simulation studies compared to traditional regression that requires estimating densities first, especially when there are limited repeated measures per subject, and it also accommodates various forms of data dependencies.
View Article and Find Full Text PDF

We present a new method for constructing valid covariance functions of Gaussian processes for spatial analysis in irregular, non-convex domains such as bodies of water. Standard covariance functions based on geodesic distances are not guaranteed to be positive definite on such domains, while existing non-Euclidean approaches fail to respect the partially Euclidean nature of these domains where the geodesic distance agrees with the Euclidean distances for some pairs of points. Using a visibility graph on the domain, we propose a class of covariance functions that preserve Euclidean-based covariances between points that are connected in the domain while incorporating the non-convex geometry of the domain via conditional independence relationships.

View Article and Find Full Text PDF

When studying the impact of policy interventions or natural experiments on air pollution, such as new environmental policies or the opening or closing of an industrial facility, careful statistical analysis is needed to separate causal changes from other confounding factors. Using COVID-19 lockdowns as a case study, we present a comprehensive framework for estimating and validating causal changes from such perturbations. We propose using flexible machine learning-based comparative interrupted time series (CITS) models for estimating such a causal effect.

View Article and Find Full Text PDF

Low-cost air pollution sensors, offering hyper-local characterization of pollutant concentrations, are becoming increasingly prevalent in environmental and public health research. However, low-cost air pollution data can be noisy, biased by environmental conditions, and usually need to be field-calibrated by collocating low-cost sensors with reference-grade instruments. We show, theoretically and empirically, that the common procedure of regression-based calibration using collocated data systematically underestimates high air pollution concentrations, which are critical to diagnose from a health perspective.

View Article and Find Full Text PDF

Low-cost air quality monitors are growing in popularity among both researchers and community members to understand variability in pollutant concentrations. Several studies have produced calibration approaches for these sensors for ambient air. These calibrations have been shown to depend primarily on relative humidity, particle size distribution, and particle composition, which may be different in indoor environments.

View Article and Find Full Text PDF

Graphical models have witnessed significant growth and usage in spatial data science for modeling data referenced over a massive number of spatial-temporal coordinates. Much of this literature has focused on a single or relatively few spatially dependent outcomes. Recent attention has focused upon addressing modeling and inference for substantially large number of outcomes.

View Article and Find Full Text PDF

In recent years, the field of neuroimaging has undergone a paradigm shift, moving away from the traditional brain mapping approach towards the development of integrated, multivariate brain models that can predict categories of mental events. However, large interindividual differences in both brain anatomy and functional localization after standard anatomical alignment remain a major limitation in performing this type of analysis, as it leads to feature misalignment across subjects in subsequent predictive models. This article addresses this problem by developing and validating a new computational technique for reducing misalignment across individuals in functional brain systems by spatially transforming each subject's functional data to a common latent template map.

View Article and Find Full Text PDF

Feature selection to identify spatially variable genes or other biologically informative genes is a key step during analyses of spatially-resolved transcriptomics data. Here, we propose nnSVG, a scalable approach to identify spatially variable genes based on nearest-neighbor Gaussian processes. Our method (i) identifies genes that vary in expression continuously across the entire tissue or within a priori defined spatial domains, (ii) uses gene-specific estimates of length scale parameters within the Gaussian process models, and (iii) scales linearly with the number of spatial locations.

View Article and Find Full Text PDF

Functional magnetic resonance imaging (fMRI) has provided invaluable insight into our understanding of human behavior. However, large inter-individual differences in both brain anatomy and functional localization anatomical alignment remain a major limitation in conducting group analyses and performing population level inference. This paper addresses this problem by developing and validating a new computational technique for reducing misalignment across individuals in functional brain systems by spatially transforming each subjects functional data to a common reference map.

View Article and Find Full Text PDF

Low-cost sensors are often co-located with reference instruments to assess their performance and establish calibration equations, but limited discussion has focused on whether the duration of this calibration period can be optimized. We placed a multipollutant monitor that contained sensors that measure particulate matter smaller than 2.5 μm (PM), carbon monoxide (CO), nitrogen dioxide (NO), ozone (O), and nitric oxide (NO) at a reference field site for one year.

View Article and Find Full Text PDF

Historically, two primary criticisms statisticians have of machine learning and deep neural models is their lack of uncertainty quantification and the inability to do inference (i.e., to explain what inputs are important).

View Article and Find Full Text PDF

Low-cost sensors enable finer-scale spatiotemporal measurements within the existing methane (CH) monitoring infrastructure and could help cities mitigate CH emissions to meet their climate goals. While initial studies of low-cost CH sensors have shown potential for effective CH measurement at ambient concentrations, sensor deployment remains limited due to questions about interferences and calibration across environments and seasons. This study evaluates sensor performance across seasons with specific attention paid to the sensor's understudied carbon monoxide (CO) interferences and environmental dependencies through long-term ambient co-location in an urban environment.

View Article and Find Full Text PDF

Sub-Saharan Africa lacks timely, reliable, and accurate national data on mortality and causes of death (CODs). In 2018 Mozambique launched a sample registration system (Countrywide Mortality Surveillance for Action [COMSA]-Mozambique), which collects continuous birth, death, and COD data from 700 randomly selected clusters, a nationally representative population of 828,663 persons. Verbal and social autopsy interviews are conducted for COD determination.

View Article and Find Full Text PDF

Verbal autopsies (VAs) are extensively used to determine cause of death (COD) in many low- and middle-income countries. However, COD determination from VA can be inaccurate. Computer coded verbal autopsy (CCVA) algorithms used for this task are imperfect and misclassify COD for a large proportion of deaths.

View Article and Find Full Text PDF

The Countrywide Mortality Surveillance for Action platform is collecting verbal autopsy (VA) records from a nationally representative sample in Mozambique. These records are used to estimate the national and subnational cause-specific mortality fractions (CSMFs) for children (1-59 months) and neonates (1-28 days). Cross-tabulation of VA-based cause-of-death (COD) determination against that from the minimally invasive tissue sampling (MITS) from the Child Health and Mortality Prevention project revealed important misclassification errors for all the VA algorithms, which if not accounted for will lead to bias in the estimates of CSMF from VA.

View Article and Find Full Text PDF

For multivariate spatial Gaussian process (GP) models, customary specifications of cross-covariance functions do not exploit relational inter-variable graphs to ensure process-level conditional independence among the variables. This is undesirable, especially for highly multivariate settings, where popular cross-covariance functions such as the multivariate Matérn suffer from a "curse of dimensionality" as the number of parameters and floating point operations scale up in quadratic and cubic order, respectively, in the number of variables. We propose a class of multivariate "Graphical Gaussian Processes" using a general construction called "stitching" that crafts cross-covariance functions from graphs and ensures process-level conditional independence among variables.

View Article and Find Full Text PDF

Background: Low-cost sensor networks for monitoring air pollution are an effective tool for expanding spatial resolution beyond the capabilities of existing state and federal reference monitoring stations. However, low-cost sensor data commonly exhibit non-linear biases with respect to environmental conditions that cannot be captured by linear models, therefore requiring extensive lab calibration. Further, these calibration models traditionally produce point estimates or uniform variance predictions which limits their downstream in exposure assessment.

View Article and Find Full Text PDF

As part of our low-cost sensor network, we colocated multipollutant monitors containing sensors for particulate matter, carbon monoxide, ozone, nitrogen dioxide, and nitrogen monoxide at a reference field site in Baltimore, MD, for 1 year. The first 6 months were used for training multiple regression models, and the second 6 months were used to evaluate the models. The models produced accurate hourly concentrations for all sensors except ozone, which likely requires nonlinear methods to capture peak summer concentrations.

View Article and Find Full Text PDF

Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region.

View Article and Find Full Text PDF