Meeting Documents
PyDAP Revisited: Exploiting OPeNDAP’s Data-Proximate Transformation Tools for Accelerate Scientific Workflows
Presented at: AGU Annual Meeting 2024
Abstract
OPeNDAP is an open-source software broadly used by the scientific community and agencies, educational and research institutions, and the private sector, to freely and efficiently share data across the web or the commercial cloud. While many existing client APIs can access OPeNDAP-served data, the vast majority falls short, or are unaware, of OPeNDAP’s server-side capabilities such as data-proximate spatial subsetting that can easily be specified in the URL via Constraint Expressions. Among the many client APIs that can communicate with OPeNDAP servers is PyDAP, an open-source python package already used internally within xarray to access remote datasets. Despite PyDAP’s popularity, it has lagged behind the newer and continuously evolving pythonic practices and developments introduced in the past years, resulting in many scenarios in sub-optimal performance.
In this presentation we describe work done to upgrade and revamp PyDAP's client to enable accelerated workflows with a focus on archival datasets available through OPeNDAP servers. Broadly speaking, upgrades to PyDAP’s client can be categorized in the following three areas: a) complete DAP4 / NetCDF4 compatibility to allow access to hierarchical group data, b) efficient parallelism for aggregating and thus processing multiple granules/files, and c) upstream integration with xarray including the newer datatree features. We will show how such developments, in addition to the use of data-proximate subsetting via Constraint Expressions in URLs, can lead to nearly 2 orders of magnitude speed ups in various scientific workflows. Some examples we will employ are: analyzing PACE satellite data, and ECCOv4 model output (native grid), all while using the familiar and widely popular xarray API. We hope that our efforts to revamp PyDAP, along with adequate use of data-proximate subsetting via OPeNDAP’s servers, will have a significant positive impact on the scientific community, data democratization, and open science. Lastly, we will also outline ongoing development efforts on PyDAP's server for cloud-ready and cloud-optimized data workflows.
View Document (AGU) »