Skip to content

gdalxarray cookbook

Recipes for opening real cloud-native and remote datasets. These exercise the full composition of gdalxarray's three open modes with GDAL's virtual-path layer. Many are dataset-specific - paths, regions, codec support, and auth requirements change over time.

For an introduction to the package itself see the README.

Network and config setup

Most recipes here read from anonymous public buckets or HTTPS-served files. A few common configs:

import os

# Public AWS buckets requiring anonymous read:
os.environ.setdefault("AWS_NO_SIGN_REQUEST", "YES")
os.environ.setdefault("AWS_REGION", "us-west-2")  # adjust per bucket

# Skip HEAD probes for /vsicurl/ - cleaner request patterns:
os.environ.setdefault("CPL_VSIL_CURL_USE_HEAD", "NO")

These can also be set per-call via gdal.SetConfigOption(...) from the osgeo.gdal module; environment variables are easier to make reproducible in notebooks.

NCI THREDDS (rate-limited HTTPS source)

Some institutional servers rate-limit aggressive parallel reads. The recipe is to cap parallelism, we find that 8 is safe for NCI Thredds.

import dask
dask.config.set(scheduler='threads', num_workers=8)

NOAA OISST sea surface temperature

A single daily file over plain HTTPS, no auth.

import xarray as xr

ds = xr.open_dataset(
    "/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-"
    "optimum-interpolation/v2.1/access/avhrr/202501/"
    "oisst-avhrr-v02r01.20250103.nc",
    engine="gdalxarray", multidim=True,
)

Compact daily product, useful for quick-look examples and as a non-throttled HTTPS sanity check.


CMEMS sea level via virtualized Zarr (Pawsey)

Copernicus Marine Service NRT sea-level product, virtualized as Zarr behind a Pawsey HTTPS proxy.

import xarray as xr

url = (
    "ZARR:\"/vsicurl/https://s3.waw3-1.cloudferro.com/mdl-arco-time-045/"
    "arco/SEALEVEL_GLO_PHY_L4_MY_008_047/cmems_obs-sl_glo_phy-ssh_my_"
    "allsat-l4-duacs-0.125deg_P1D_202411/timeChunked.zarr\""
)
ds = xr.open_dataset(url, engine="gdalxarray", multidim=True)
ds["adt"].sel(time=slice("2024-06-01", "2024-06-10")).mean(dim="time").values

The ZARR:"..." prefix wraps a /vsicurl/ HTTPS URL - GDAL's virtualization composes through.


BRAN2023 ocean reanalysis via kerchunk-Parquet

The Australian Bureau of Meteorology + CSIRO + AAD ocean reanalysis, ~135 TB of NetCDF on NCI THREDDS, virtualized as Zarr via a kerchunk-Parquet reference manifest hosted at Pawsey. Per-variable yearly manifests; daily 4-D ocean fields.

import xarray as xr

import dask
dask.config.set(scheduler='threads', num_workers=8)

url = (
    "vrt:///vsicurl/https://projects.pawsey.org.au/aad-index/"
    "vzarr/ocean_temp_2023.parq"
)
ds = xr.open_dataset(
    url, engine="gdalxarray", multidim=True,
    drop_variables=["Time_bnds", "DT", "average_DT", "average_T1", "average_T2"],
)

# Date-string slicing over a virtualized store
ds.sel(Time="2010-01-30").temp.values

The store opens in seconds; .values triggers the actual byte-range reads via kerchunk's reference manifest.


ECMWF AIFS forecast via Icechunk on S3

ECMWF's AIFS AI weather forecasts, published as an Icechunk store on anonymous S3 by Earthmover/dynamical.org. ~14 TB across 21 variables, 6-hourly initialisations, CC-BY-4.0.

import os
import xarray as xr

os.environ.setdefault("AWS_NO_SIGN_REQUEST", "YES")
os.environ.setdefault("AWS_REGION", "us-west-2")

ds = xr.open_dataset(
    "/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/"
    "v0.1.0.icechunk",
    engine="gdalxarray", multidim=True,
)

# Single-point wind at Hobart, first forecast hour of first init time:
hobart_u10 = (
    ds["wind_u_10m"]
    .isel(init_time=0, lead_time=0)
    .sel(latitude=-42.9, longitude=147.3, method="nearest")
    .values
)

Requires a GDAL build with the Icechunk driver compiled in. The Icechunk driver is in active development; check your GDAL build's capabilities with gdalinfo --formats | grep -i icechunk.


Earthmover ERA5 (Icechunk on S3, partially readable)

ECMWF ERA5 reanalysis published by Earthmover as an Icechunk store. Coordinate and mask variables open cleanly; data variables currently fail because they use the numcodecs.pcodec codec which GDAL doesn't yet support.

import os
import xarray as xr

os.environ.setdefault("AWS_NO_SIGN_REQUEST", "YES")
os.environ.setdefault("AWS_REGION", "us-east-1")

# Note the nested group syntax via the /vsiicechunk/ VFS:
url = (
    "/vsiicechunk/{/vsis3/earthmover-icechunk-era5/icechunkV2}"
    "/pressure/spatial"
)
ds = xr.open_dataset(url, engine="gdalxarray", multidim=True)

# `ds` will be a Dataset with all dimensions and coordinate variables,
# but reading data variables raises:
#   RuntimeError: Unsupported codec: numcodecs.pcodec

This is a gdalxarray limitation downstream of a GDAL one. Tracked upstream at osgeo/gdal - when GDAL adds pcodec, this works without changes here.


Single 2D slice of a multidim Icechunk store

GDAL supports a composable VRT-style path syntax for picking a single 2D face out of an N-D Icechunk store, usable from any classic GDAL tool:

import xarray as xr

# init_time=0, lead_time=0, latitude  longitude of one variable:
url = (
    'ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/'
    'ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/wind_u_10m:{0}:{0}'
)
ds = xr.open_dataset(url, engine="gdalxarray", multidim=False)
# Dataset with shape (1, 721, 1440) - band_data dim is 1, y is lat, x is lon

This composition lets gdal_translate, gdal_warp, etc. consume specific forecast frames as if they were GeoTIFFs.


GEBCO bathymetry (Cloud-Optimised GeoTIFF over HTTPS)

A 14 GB COG of global bathymetry, served via Pawsey's HTTPS endpoint. Classic raster mode; gdalxarray will lazily read only the windows you ask for.

import xarray as xr

ds = xr.open_dataset(
    "/vsicurl/https://projects.pawsey.org.au/idea-gebco-tif/GEBCO_2024.tif",
    engine="gdalxarray", multidim=False,
)
# Window around Hobart:
hobart_box = ds["band_data"].sel(
    x=slice(146.5, 148.5),
    y=slice(-42.5, -43.5),
)

Patterns worth knowing

Composing virtual paths: all GDAL virtual paths can be nested. ZARR:"/vsis3/bucket/...store.zarr" opens a Zarr inside an S3 bucket. ZARR:"/vsicurl/https://.../store.zarr" opens one served over HTTPS. ZARR:"/vsiicechunk/{/vsis3/.../v0.icechunk}" opens an Icechunk via its virtual filesystem.

Picking subdatasets: NETCDF:path:var and vrt://path?sd_name=var both work for picking a single subdataset from a multi-variable file. The vrt:// form composes with other GDAL options (resampling, output size).

Checking driver name post-open: ds.encoding["gdal_driver"] is the driver GDAL used. Useful for distinguishing Zarr from Icechunk from NetCDF when the same data is accessible multiple ways.

Skip a known-bad variable at open: drop_variables=["X", "Y"]. Faster than open-then-drop because the BackendArray is never built.