A. Read ITS_LIVE data from AWS S3 using Xarray

This notebook demonstrates how to query and access cloud-hosted Inter-mission Time Series of Land Ice Velocity and Elevation (ITS_LIVE) data from Amazon Web Services (AWS) S3 buckets. These data are stored as Zarr data cubes, a cloud-optimized format for array data. They are read into memory as Xarray Datasets.

Note

This tutorial was updated Jan 2025 to reflect changes to ITS_LIVE data urls and various software libraries

Outline

A. Read ITS_LIVE data from AWS S3 using Xarray

1. Overview of ITS_LIVE data storage and catalog
1. Read ITS_LIVE data from S3 storage into memory
1. Check spatial footprint of data

B. Query ITS_LIVE catalog

1. Find ITS_LIVE granule for a point of interest
1. Read + visualize spatial footprint of ITS_LIVE data

C. Overview of ITS_LIVE data

Note, this should maybe become its own notebook and/or chapter

Learning goals

Concepts

Query and access cloud-optimized dataset from cloud object storage
Create a vector data object representing the footprint of a raster dataset
Preliminary visualization of data extent

Techniques

Use Xarray to open Zarr datacube stored in AWS S3 bucket
Interactive data visualization with hvplot
Create Geopandas geodataframe from Xarray xr.Dataset object

Expand the next cell to see specific packages used in this notebook and relevant system and version information.

Exception reporting mode: Minimal

A. Read ITS_LIVE data from AWS S3 using Xarray#

1) Overview of ITS_LIVE data storage and catalog#

The ITS_LIVE project details a number of data access options on their website. Here, we will be accessing ITS_LIVE data in the form of zarr data cubes that are stored in s3 buckets hosted by Amazon Web Services (AWS).

Let’s begin by looking at the GeoJSON data cubes catalog. Click this link to download the file. This catalog contains spatial information and properties of ITS_LIVE data cubes as well as the URL used to access each cube. Let’s take a look at the entry for a single data cube and the information that it contains:

itslive_info

The top portion of the picture shows the spatial extent of the data cube in lat/lon units. Below that, we have properties such as the epsg code of the coordinate reference system, the spatial footprint in projected units and the url of the zarr object.

Let’s take a look at the url more in-depth:

itslive_url

From this link we can see that we are looking at its_live data located in an s3 bucket hosted by amazon AWS. We cans see that we’re looking in the data cube directory and what seems to be version 2. The next bit gives us information about the global location of the cube (N40E080). The actual file name ITS_LIVE_vel_EPSG32645_G0120_X250000_Y4750000.zarr tells us that we are looking at ice velocity data (its_live also has elevation data), in the CRS associated with EPSG 32645 (this code indicates UTM zone 45N). X250000_Y4750000 tells us more about the spatial footprint of the datacube within the UTM zone.

2) Read ITS_LIVE data from S3 storage into memory#

We’ve found the url associated with the tile we want to access, let’s try to open the data cube using Xarray.open_dataset():

url = 'http://its-live-data.s3.amazonaws.com/datacubes/v2/N30E090/ITS_LIVE_vel_EPSG32646_G0120_X750000_Y3350000.zarr'

In addition to passing url to xr.open_dataset(), we include chunks='auto'. This introduces dask into our workflow; chunks='auto' will choose chunk sizes that match the underlying data structure; this is often ideal, but sometimes you may need to specify different chunking schemes. You can read more about choosing good chunk sizes here; subsequent notebooks in this tutorial will explore different approaches to dask chunking.

dc = xr.open_dataset(url, decode_timedelta=True) 

syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <?xml^ version="1.0" encoding="UTF-8"?><Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>its-live-data.s3-us-west-2.amazonaws.com</Endpoint><Bucket>its-live-data</Bucket><RequestId>4HK62KBQ4G8PSQ19</RequestId><HostId>LWVOX3GLVAh5U47Zgww1YyjDbweZ0yttI1WsU3rGhAkNv8nnB/zfTqGY9gRdOY8fr0gBRlGWC/4=</HostId></Error>

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('http://its-live-data.s3.amazonaws.com/datacubes/v2/N30E090/ITS_LIVE_vel_EPSG32646_G0120_X750000_Y3350000.zarr',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'f9076f5d-9102-4a73-9e43-2aaface595c2']

During handling of the above exception, another exception occurred:

OSError: [Errno -72] NetCDF: Malformed or inaccessible DAP2 DDS or DAP4 DMR response: 'http://its-live-data.s3.amazonaws.com/datacubes/v2/N30E090/ITS_LIVE_vel_EPSG32646_G0120_X750000_Y3350000.zarr'

As you can see, this doesn’t quite work. When passing the url to xr.open_dataset(), if a backend isn’t specified, xarray will expect a netcdf file. Because we’re trying to open a zarr file we need to add an additional argument to xr.open_dataset(), shown in the next code cell. You can find more information here.

In the following cell, the argument chunks="auto" is passed, which introduces Dask into our workflow; chunks='auto' will choose chunk sizes that match the underlying data structure (this is generally ideal). You can read more about choosing good chunk sizes here; subsequent notebooks in this tutorial will explore different approaches to dask chunking.

dc = xr.open_dataset(url, engine= 'zarr', chunks="auto", decode_timedelta=True)
                                   # storage_options = {'anon':True}) <-- as of Fall 2023 this no longer needed

dc

<xarray.Dataset> Size: 771GB
Dimensions:                     (mid_date: 25243, y: 833, x: 833)
Coordinates:
  * mid_date                    (mid_date) datetime64[ns] 202kB 2022-06-07T04...
  * x                           (x) float64 7kB 7.001e+05 7.003e+05 ... 8e+05
  * y                           (y) float64 7kB 3.4e+06 3.4e+06 ... 3.3e+06
Data variables: (12/60)
    M11                         (mid_date, y, x) float32 70GB dask.array<chunksize=(25243, 30, 30), meta=np.ndarray>
    M11_dr_to_vr_factor         (mid_date) float32 101kB dask.array<chunksize=(25243,), meta=np.ndarray>
    M12                         (mid_date, y, x) float32 70GB dask.array<chunksize=(25243, 30, 30), meta=np.ndarray>
    M12_dr_to_vr_factor         (mid_date) float32 101kB dask.array<chunksize=(25243,), meta=np.ndarray>
    acquisition_date_img1       (mid_date) datetime64[ns] 202kB dask.array<chunksize=(25243,), meta=np.ndarray>
    acquisition_date_img2       (mid_date) datetime64[ns] 202kB dask.array<chunksize=(25243,), meta=np.ndarray>
    ...                          ...
    vy_error_modeled            (mid_date) float32 101kB dask.array<chunksize=(25243,), meta=np.ndarray>
    vy_error_slow               (mid_date) float32 101kB dask.array<chunksize=(25243,), meta=np.ndarray>
    vy_error_stationary         (mid_date) float32 101kB dask.array<chunksize=(25243,), meta=np.ndarray>
    vy_stable_shift             (mid_date) float32 101kB dask.array<chunksize=(25243,), meta=np.ndarray>
    vy_stable_shift_slow        (mid_date) float32 101kB dask.array<chunksize=(25243,), meta=np.ndarray>
    vy_stable_shift_stationary  (mid_date) float32 101kB dask.array<chunksize=(25243,), meta=np.ndarray>
Attributes: (12/19)
    Conventions:                CF-1.8
    GDAL_AREA_OR_POINT:         Area
    author:                     ITS_LIVE, a NASA MEaSUREs project (its-live.j...
    autoRIFT_parameter_file:    http://its-live-data.s3.amazonaws.com/autorif...
    datacube_software_version:  1.0
    date_created:               25-Sep-2023 22:00:23
    ...                         ...
    s3:                         s3://its-live-data/datacubes/v2/N30E090/ITS_L...
    skipped_granules:           s3://its-live-data/datacubes/v2/N30E090/ITS_L...
    time_standard_img1:         UTC
    time_standard_img2:         UTC
    title:                      ITS_LIVE datacube of image pair velocities
    url:                        https://its-live-data.s3.amazonaws.com/datacu...

This one worked! Let’s stop here and define a function that we can use to read additional s3 objects into memory as Xarray Datasets. This will come in handy later in this notebook and in subsequent notebooks.

def read_in_s3(http_url:str, chunks:Union[None, str, dict] = 'auto') -> xr.Dataset:
    """I'm a function that takes a url pointing to the location of a zarr data cube.
    I return an Xarray Dataset. I can take an optional chunk argument which specifies 
    how the data will be chunked when read into memory"""
    
    #s3_url = http_url.replace('http','s3')   <-- as of Fall 2023, can pass http urls directly to xr.open_dataset()
    #s3_url = s3_url.replace('.s3.amazonaws.com','')

    datacube = xr.open_dataset(http_url, engine = 'zarr',
                                #storage_options={'anon':True},
                                chunks = chunks)

    return datacube

3) Check spatial footprint of data#

We just read in a very large dataset. We’d like an easy way to be able to visualize the footprint of this data to ensure we specified the correct location without plotting a data variable over the entire footprint, which would be much more computationally and time-intensive. The following function creates a GeoPandas.GeoDataFrame describing the spatial footprint of an xr.Dataset.

def get_bounds_polygon(input_xr: xr.Dataset) -> gpd.GeoDataFrame:
    """ I'm a function that takes an Xarray Dataset and returns a GeoPandas DataFrame of the bounding box of the Xarray Dataset."""
    
    xmin = input_xr.coords['x'].data.min()
    xmax = input_xr.coords['x'].data.max()

    ymin = input_xr.coords['y'].data.min()
    ymax = input_xr.coords['y'].data.max()

    pts_ls = [(xmin, ymin), (xmax, ymin),(xmax, ymax), (xmin, ymax), (xmin, ymin)]

    crs = f"epsg:{input_xr.mapping.spatial_epsg}"

    polygon_geom = Polygon(pts_ls)
    polygon = gpd.GeoDataFrame(index=[0], crs=crs, geometry=[polygon_geom]) 
    
    return polygon

Now let’s take a look at the cube we’ve already read:

bbox = get_bounds_polygon(dc)

get_bounds_polygon() returns a geopandas.GeoDataFrame object in the same projection as the velocity data object (local UTM). Re-project to latitude/longitude to view the object more easily on a map:

bbox = bbox.to_crs('EPSG:4326')

To visualize the footprint, we use the interactive plotting library, hvPlot.

poly = bbox.hvplot(legend=True,alpha=0.3, tiles='ESRI', color='red', geo=True)
poly

B. Query ITS_LIVE catalog#

1) Find ITS_LIVE granule for a point of interest#

Let’s look in a different region and see how we could search the ITS_LIVE data cube catalog for the granule that covers our location of interest. There are many ways to do this, this is just one example.

First, we read in the catalog geojson file:

itslive_catalog = gpd.read_file('https://its-live-data.s3.amazonaws.com/datacubes/catalog_v02.json')

Below is a function to query the catalog for the s3 url covering a given point. You could easily tweak this function (or write your own!) to select granules based on different properties. Play around with the itslive_catalog object to become more familiar with the data it contains and different options for indexing.

Note: since this tutorial was originally written, the ITS_LIVE Python Client was released. This is a great way to access ITS_LIVE data cubes.

fs  = s3fs.S3FileSystem(anon=True)
fs

<s3fs.core.S3FileSystem at 0x7fe871fd1e50>

def find_granule_by_point(input_point: list) -> str:
    """ I take a point in [lon, lat] format and return the url of the granule containing specified point.
    Point must be passed in EPSG:4326."""
    
    catalog = gpd.read_file('https://its-live-data.s3.amazonaws.com/datacubes/catalog_v02.json')

    #make shapely point of input point
    p = gpd.GeoSeries([Point(input_point[0], input_point[1])],crs='EPSG:4326')
    #make gdf of point
    gdf = gdf = gpd.GeoDataFrame({'label': 'point', 
                                  'geometry':p})
    #find row of granule 
    granule = catalog.sjoin(gdf, how='inner')

    url = granule['zarr_url'].values[0]
    return url

Choose a location in Alaska:

url = find_granule_by_point([-138.958776, 60.748561])
url

'http://its-live-data.s3.amazonaws.com/datacubes/v2-updated-october2024/N60W130/ITS_LIVE_vel_EPSG3413_G0120_X-3250000_Y250000.zarr'

Great, this function returned a single url corresponding to the data cube covering the point we supplied. Let’s use the read_in_s3 function we defined to open the datacube as an xarray.Dataset

datacube = read_in_s3(url)

2) Read + visualize spatial footprint of ITS_LIVE data#

Use the get_bounds_polyon function to take a look at the footprint using hvplot().

bbox_dc = get_bounds_polygon(datacube)

poly =  bbox_dc.to_crs('EPSG:4326').hvplot(legend=True,alpha=0.5, tiles='ESRI', color='red', geo=True)
poly

C. Overview of ITS_LIVE data#

Let’s briefly take a look at this ITS_LIVE time series object within the context of the Xarray data model. If you’re new to working with Xarray, the Data Structures documentation is very useful for getting a hang of the different components that are the building blocks of Xarray.Dataset objects.

datacube

<xarray.Dataset> Size: 4TB
Dimensions:                     (mid_date: 138421, y: 834, x: 834)
Coordinates:
  * mid_date                    (mid_date) datetime64[ns] 1MB 2020-03-16T08:4...
  * x                           (x) float64 7kB -3.3e+06 -3.3e+06 ... -3.2e+06
  * y                           (y) float64 7kB 2.999e+05 2.998e+05 ... 2e+05
Data variables: (12/60)
    M11                         (mid_date, y, x) float32 385GB dask.array<chunksize=(40000, 20, 20), meta=np.ndarray>
    M11_dr_to_vr_factor         (mid_date) float32 554kB dask.array<chunksize=(138421,), meta=np.ndarray>
    M12                         (mid_date, y, x) float32 385GB dask.array<chunksize=(40000, 20, 20), meta=np.ndarray>
    M12_dr_to_vr_factor         (mid_date) float32 554kB dask.array<chunksize=(138421,), meta=np.ndarray>
    acquisition_date_img1       (mid_date) datetime64[ns] 1MB dask.array<chunksize=(138421,), meta=np.ndarray>
    acquisition_date_img2       (mid_date) datetime64[ns] 1MB dask.array<chunksize=(138421,), meta=np.ndarray>
    ...                          ...
    vy_error_modeled            (mid_date) float32 554kB dask.array<chunksize=(138421,), meta=np.ndarray>
    vy_error_slow               (mid_date) float32 554kB dask.array<chunksize=(138421,), meta=np.ndarray>
    vy_error_stationary         (mid_date) float32 554kB dask.array<chunksize=(138421,), meta=np.ndarray>
    vy_stable_shift             (mid_date) float32 554kB dask.array<chunksize=(138421,), meta=np.ndarray>
    vy_stable_shift_slow        (mid_date) float32 554kB dask.array<chunksize=(138421,), meta=np.ndarray>
    vy_stable_shift_stationary  (mid_date) float32 554kB dask.array<chunksize=(138421,), meta=np.ndarray>
Attributes: (12/19)
    Conventions:                CF-1.8
    GDAL_AREA_OR_POINT:         Area
    author:                     ITS_LIVE, a NASA MEaSUREs project (its-live.j...
    autoRIFT_parameter_file:    http://its-live-data.s3.amazonaws.com/autorif...
    datacube_software_version:  1.0
    date_created:               25-Sep-2023 22:54:32
    ...                         ...
    s3:                         s3://its-live-data/datacubes/v2/N60W130/ITS_L...
    skipped_granules:           s3://its-live-data/datacubes/v2/N60W130/ITS_L...
    time_standard_img1:         UTC
    time_standard_img2:         UTC
    title:                      ITS_LIVE datacube of image pair velocities
    url:                        https://its-live-data.s3.amazonaws.com/datacu...

Data structure overview#

Dimensions + Coordinates#

This object has 3 dimensions, mid_date, x, and y. Each dimension has a corresponding coordinate variable.
If decide to add this section/content, should really move a lot of the Xarray variables to be coordinates, and/or least have a quick discussion of what many of the variables are and say that we will be focusing on these variables …

Data variables#

Expanding the ‘Data Variables’ label, you can see that there are many (60!) variables. Each variable exists along one or more dimension (eg. (mid_date,x,y)), has an associated data type (eg.float32), and has an underlying array that holds that variable’s data.

Attributes#

Whereas data variables are by definition dimensional, there is usually a collection of scalar metadata that informs how data variables should be manipulated and interpreted. Scalar metadata is stored in attributes. All array-based Xarray objects (data variables, coordinate variables, DataArrays and Datasets) can have attributes attached to them.

Add section on CF conventions + metadata naming…#

Conclusion

This notebook demonstrated how to query and access a cloud-optimized remote sensing time series dataset stored in an AWS S3 bucket. The subsequent notebooks in this tutorial will go into much more detail on how to organize, examine and analyze this data.