A. Read and organize data

The previous notebooks in this tutorial demonstrated how to use Xarray to access, inspect, manipulate and analyze raster time series data at the scale of an individual glacier. In this notebook, we shift our focus to a sub-regional scale, looking at all of the glaciers within a given ITS_LIVE data cube. This workflow will draw on elements from the past notebooks while introducing new tools for examining raster data along temporal and spatial dimensions (ie. across multiple glaciers).

Outline

A. Read and organize data

1. Raster data
1. Vector data

**B. Combine raster and vector data to create a vector data cube

1. Make a vector data cube
1. Add attribute data to vector cube
1. Write vector cube to disk

C. Data visualization

1. Read vector data cube into memory
1. Visualize velocity data
1. Visualize associations between velocity and attribute data

Learning goals

Concepts

Querying and accessing raster data from cloud object storage
Accessing and manipulating vector data
Handling coordinate reference information
Calculating and visualizing summary statistics

Techniques

Access cloud-hosted Zarr data cubes using Xarray
Reading GeoParquet vector data using GeoPandas
Rasterize vector objects using Geocube
Spatial joins of vector datasets using GeoPandas
Using dask to work with out-of-memory data
Calculating summary statistics of Xarray and Pandas objects
Data visualization using Pandas
Interactive data visualization with GeoPandas

Expand the next cell to see specific packages used in this notebook and relevant system and version information.

Exception reporting mode: Minimal

A. Read and organize data#

2) Vector data#

se_asia = gpd.read_parquet('../data/rgi7_region15_south_asia_east.parquet')
se_asia.head(3)

	rgi_id	o1region	o2region	glims_id	anlys_id	subm_id	src_date	cenlon	cenlat	utm_zone	...	zmin_m	zmax_m	zmed_m	zmean_m	slope_deg	aspect_deg	aspect_sec	dem_source	lmax_m	geometry
0	RGI2000-v7.0-G-15-00001	15	15-01	G078088E31398N	866850	752	2002-07-10T00:00:00	78.087891	31.398046	44	...	4662.2950	4699.2095	4669.4720	4671.4253	13.427070	122.267290	4	COPDEM30	173	POLYGON Z ((78.08905 31.39784 0.00000, 78.0889...
1	RGI2000-v7.0-G-15-00002	15	15-01	G078125E31399N	867227	752	2002-07-10T00:00:00	78.123699	31.397796	44	...	4453.3584	4705.9920	4570.9473	4571.2770	22.822983	269.669144	7	COPDEM30	1113	POLYGON Z ((78.12556 31.40257 0.00000, 78.1255...
2	RGI2000-v7.0-G-15-00003	15	15-01	G078128E31390N	867273	752	2000-08-05T00:00:00	78.128510	31.390287	44	...	4791.7593	4858.6807	4832.1836	4827.6700	15.626262	212.719681	6	COPDEM30	327	POLYGON Z ((78.12960 31.39093 0.00000, 78.1296...

3 rows × 29 columns

What coordinate reference system is this dataframe in?

se_asia.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

The vector dataset is in WGS 84, meaning that its coordinates are in degrees latitude and longitude rather than meters N and E. We will project this dataset to match the projection of the raster dataset.

se_asia_prj = se_asia.to_crs(crs)

print(len(se_asia_prj))

The vector dataframe representing glacier outlines is very large. For now, we’re only interested in glaciers that lie within the footprint of the ITS_LIVE granule we’re working with and that are larger than 5 square kilometers in area. We subset the full dataset to match those conditions:

To start with, we will look only at glaciers larger in area than 5km2. Subset the dataset to select for those glaciers; start by making a GeoDataFrame of the ITS_LIVE granule footprint in order to perform a spatial join and select only the glaciers from the RGI dataframe (se_asia_prj) that are within the granule:

dc_bbox = itslivetools.get_bounds_polygon(dc)
dc_bbox['Label'] = ['Footprint of ITS_LIVE granule']

#Spatial join
rgi_subset = gpd.sjoin(se_asia_prj, 
                         dc_bbox,
                         how='inner')
#Select only glaciers where area >= 5 km2
rgi_subset = rgi_subset.loc[rgi_subset['area_km2'] >= 5.]

print(f'Now, we are looking at {len(rgi_subset)} glaciers.')

Now, we are looking at 28 glaciers.

B. Combine raster and vector data to create a vector data cube#

Vector data cubes are a data structure similar to a raster data cube, but with a dimension represented by an array of geometries. For a detailed explanation of vector data cubes, see Edzer Pebesma’s write-up. Xvec is a relatively new Python package that implements vector data cubes within the Xarray ecosystem. This is an exciting development that can drastically simplify workflows that examine data along both spatial and temporal dimensions and involve spatial features represented by vector points, lines and polygons.

To explain this in more detail, we currently have a raster data cube (the ITS_LIVE time series) that covers the entire spatial footprint shown in blue below. However, the locations in which we are interested in this data are the glaciers outlined in red. Working with this data as a 3-dimensional vector cube is not a very efficient way of accessing the data at the locations shown in red.

../_images/4cd0b5f07328de1cbe477f7ff2ae5e5e81de2d2d9eea1f48fb8d1dc1770bdbfe.png

Instead, we use the Xvec.zonal_stats() method to convert the 3-dimensional cube to a 2-dimensional cube that has time dimesnion and a geometry dimension. Each element of the geometry dimension is a glacier from the rgi_subset dataframe.

Note

Because we are working with polygon geometries, we use zonal_stats() which performs a reduction over the area of the polygon. If our vector data was made up of point features, we could use Xvec.extract_points()

1) Make a vector data cube#

dask.config.set({"array.slicing.split_large_chunks": True})
vector_data_cube = dc_resamp.xvec.zonal_stats(rgi_subset.geometry,
                                       x_coords='x', y_coords='y',
).drop_vars('index')

vector_data_cube

<xarray.Dataset> Size: 898kB
Dimensions:                     (geometry: 28, mid_date: 154)
Coordinates:
  * mid_date                    (mid_date) datetime64[ns] 1kB 1986-09-30 ... ...
  * geometry                    (geometry) object 224B POLYGON Z ((698995.534...
Data variables: (12/48)
    M11                         (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    M12                         (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    chip_size_height            (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    chip_size_width             (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    interp_mask                 (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    v                           (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    ...                          ...
    vy_error_modeled            (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_error_slow               (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_error_stationary         (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_stable_shift             (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_stable_shift_slow        (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_stable_shift_stationary  (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
Indexes:
    geometry  GeometryIndex (crs=EPSG:32646)
Attributes: (12/19)
    Conventions:                CF-1.8
    GDAL_AREA_OR_POINT:         Area
    author:                     ITS_LIVE, a NASA MEaSUREs project (its-live.j...
    autoRIFT_parameter_file:    http://its-live-data.s3.amazonaws.com/autorif...
    datacube_software_version:  1.0
    date_created:               25-Sep-2023 22:00:23
    ...                         ...
    s3:                         s3://its-live-data/datacubes/v2/N30E090/ITS_L...
    skipped_granules:           s3://its-live-data/datacubes/v2/N30E090/ITS_L...
    time_standard_img1:         UTC
    time_standard_img2:         UTC
    title:                      ITS_LIVE datacube of image pair velocities
    url:                        https://its-live-data.s3.amazonaws.com/datacu...

Great, now we’ve gone from a 3-d object with (mid_date,x,y) dimensions to a 2-d object with (mid_date, geometry) dimensions. However, in addition to the geometry data stored in the vector dataframe, we’d also like to add some of the attribute data to the ITS_LIVE time series vector cube. The following cell adds attributes as coordinate variables to the vector data cube.

2) Add attribute data to vector cube#

#Define attributes to be added
rgi_attrs_dict = {'RGIId': 'rgi_id',
              'Area_km2': 'area_km2',
              'Slope_deg': 'slope_deg'}

def update_cube_attrs(ds, gdf, attrs_dict):
    for k, v in attrs_dict.items():
        ds[k] = (('geometry'), gdf[v].values)
        ds = ds.assign_coords({k: ds[k]})
    return ds

vector_data_cube = update_cube_attrs(vector_data_cube, rgi_subset, rgi_attrs_dict)

vector_data_cube

<xarray.Dataset> Size: 899kB
Dimensions:                     (geometry: 28, mid_date: 154)
Coordinates:
  * mid_date                    (mid_date) datetime64[ns] 1kB 1986-09-30 ... ...
  * geometry                    (geometry) object 224B POLYGON Z ((698995.534...
    RGIId                       (geometry) object 224B 'RGI2000-v7.0-G-15-116...
    Area_km2                    (geometry) float64 224B 60.49 5.03 ... 8.753
    Slope_deg                   (geometry) float64 224B 21.73 19.25 ... 17.63
Data variables: (12/48)
    M11                         (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    M12                         (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    chip_size_height            (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    chip_size_width             (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    interp_mask                 (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    v                           (geometry, mid_date) float32 17kB dask.array<chunksize=(1, 148), meta=np.ndarray>
    ...                          ...
    vy_error_modeled            (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_error_slow               (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_error_stationary         (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_stable_shift             (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_stable_shift_slow        (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
    vy_stable_shift_stationary  (geometry, mid_date) float32 17kB dask.array<chunksize=(28, 148), meta=np.ndarray>
Indexes:
    geometry  GeometryIndex (crs=EPSG:32646)
Attributes: (12/19)
    Conventions:                CF-1.8
    GDAL_AREA_OR_POINT:         Area
    author:                     ITS_LIVE, a NASA MEaSUREs project (its-live.j...
    autoRIFT_parameter_file:    http://its-live-data.s3.amazonaws.com/autorif...
    datacube_software_version:  1.0
    date_created:               25-Sep-2023 22:00:23
    ...                         ...
    s3:                         s3://its-live-data/datacubes/v2/N30E090/ITS_L...
    skipped_granules:           s3://its-live-data/datacubes/v2/N30E090/ITS_L...
    time_standard_img1:         UTC
    time_standard_img2:         UTC
    title:                      ITS_LIVE datacube of image pair velocities
    url:                        https://its-live-data.s3.amazonaws.com/datacu...

3) Write vector cube to disk#

Later in this notebook, we will want to perform operations that require this object to be loaded into memory. All of our steps thus far have happened lazily, and when we do load the data into memory it will be quite time-consuming.

To skip this step, I’ve loaded the data into memory and written it to disk. This allows us to read the vector data cube (which is much more manageable) into memory from file.

Because this step is quite computationally intensive, I won’t execute it here, but I will include the code for the sake of completeness.

Steps to write vector data cube to disk#

Encode vector data code geometries as [CF] geometries. This is required because Shapely geometries (how geometries are represented in memory) are not compatible with array-based file formats such as Zarr. To get around this, Xvec uses a package called cf-xarray to encode the Shapely geometries as CF geometries. For more detail on reading and writing vector data cubes, see the Xvec documentation.
```
encoded_vdc = vector_data_cube.xvec.encode_cf()
```
Write encoded data cube to Zarr (This triggers .compute(), loading the data into memory before writing it to disk. On my computer, this took about 12 minutes.
```
with ProgressBar():

    encoded_vdc.to_zarr('data/regional_glacier_velocity_vector_cube.zarr', mode='w')
```
Output: [########################################] | 100% Completed | 12m 15s

C. Data visualization#

1) Read vector data cube into memory#

The vector data cube is stored on disk with CF geometries. We go through the reverse of the process we used to write the object to disk, decoding the CF geometries to Shapely geometries with xvec.decode_cf().

vector_data_cube_cf = xr.open_zarr('../data/regional_glacier_velocity_vector_cube.zarr')
vector_data_cube = vector_data_cube_cf.xvec.decode_cf().compute()

2) Visualize velocity data#

Xvec has a method, to_geodataframe() that allows us to easily convert the xr.Dataset vector cube to a gpd.GeoDataFrame. We can then use the GeoPandas .explore() method to interactively visualize the data.

We will look at mean velocities over time.

vector_data_cube_mean = vector_data_cube.mean(dim='mid_date')

vector_data_cube_mean['vmag'] = np.sqrt(vector_data_cube_mean['vx']**2 + vector_data_cube_mean['vy']**2)

vector_data_cube_mean['vmag'].xvec.to_geodataframe(geometry='geometry').explore('vmag')

Make this Notebook Trusted to load map: File -> Trust Notebook

We could also look at single seasons:

vector_cube_seasonal_mean = vector_data_cube.groupby('mid_date.season').mean()

vector_cube_seasonal_mean['vmag'] = np.sqrt(vector_cube_seasonal_mean['vx']**2 + vector_cube_seasonal_mean['vy']**2)

vector_cube_seasonal_mean.sel(season='JJA')['vmag'].xvec.to_geodataframe(geometry='geometry').explore('vmag')