# 3.1 Accessing cloud-hosted ITS_LIVE data

## Introduction

This notebook demonstrates how to query and access cloud-hosted Inter-mission Time Series of Land Ice Velocity and Elevation ([ITS_LIVE](https://its-live.jpl.nasa.gov/#access)) data from Amazon Web Services (AWS) S3 buckets. These data are stored as [Zarr](https://zarr.readthedocs.io/en/stable/) data cubes, a cloud-optimized format for array data. They are read into memory as [Xarray](https://docs.xarray.dev/en/stable/) Datasets.

{{break}}

::::{tab-set}   
:::{tab-item} Outline

(content:section_A)= 
**[A. Overview of ITS_LIVE data](#a-overview-of-its_live-data)**
- 1) Data structure overview

(content:Section_B)=
**[B. Read ITS_LIVE data from AWS S3 using Xarray](#b-read-its_live-data-from-aws-s3-using-xarray)**
- 1) Overview of ITS_LIVE data storage and catalog
- 2) Read ITS_LIVE data from S3 storage into memory
- 3) Check spatial footprint of data

(content:Section_C)=
**[C. Query ITS_LIVE catalog](#c-query-its_live-catalog)**

- 1) Find ITS_LIVE granule for a point of interest
- 2) Read + visualize spatial footprint of ITS_LIVE data
:::

:::{tab-item} Learning Goals
#### Concepts
- Understand how data is organized in AWS S3 buckets,
- Query and access cloud-optimized dataset from cloud object storage,
- Create a vector data object representing the footprint of a raster dataset,
- Preliminary visualization of data extent,
  
#### Techniques
- Use [Xarray](https://xarray.dev/) to open [Zarr](https://zarr.readthedocs.io/en/stable/) data cubes stored in [AWS S3 bucket](https://aws.amazon.com/s3/),
- Interactive data visualization with [hvplot](https://hvplot.holoviz.org/),
- Create [Geopandas](https://geopandas.org/en/stable/) `geodataframe` from Xarray `xr.Dataset` object,
:::

Expand the next cell to see specific packages used in this notebook and relevant system and version information. 

{{break}}

In [3]:
%xmode minimal
import geopandas as gpd
import hvplot.pandas
import xarray as xr
from shapely.geometry import Point, Polygon

Exception reporting mode: Minimal


## A. Overview of ITS_LIVE data

Skipping ahead a few steps, let's take a look at an ITS_LIVE data cube so that we have some expectations about what we'll see in the data catalog and once we read a data cube into memory. 

Specifically, we want to understand an ITS_LIVE time series data cube in the context of the Xarray data model. If you're new to working with Xarray, the [Data Structures](https://docs.xarray.dev/en/latest/user-guide/data-structures.html) documentation is very useful for getting a hang of the different components that are the building blocks of `Xarray.Dataset` objects.

In [4]:
init_url = "http://its-live-data.s3.amazonaws.com/datacubes/v2-updated-october2024/N60W130/ITS_LIVE_vel_EPSG3413_G0120_X-3250000_Y250000.zarr"
datacube = xr.open_dataset(init_url, engine="zarr", decode_timedelta=True, chunks="auto")

In [5]:
datacube['satellite_img1'].encoding

{'chunks': (100426,),
 'preferred_chunks': {'mid_date': 100426},
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 'dtype': dtype('<U2')}

### 1) Data structure overview

#### Dimensions
- This object has 3 *dimensions*, `mid_date`, `x`, and `y`.
- Each dimension has a corresponding coordinate variable of the same name. Think of these as "axis ticks" on a figure if you were to plot the data.

#### Data Variables
- Expanding the 'Data Variables' label, you can see that there are many (60!) variables.
- Each variable exists along one or more dimension (eg. `(mid_date,x,y)`), has an associated data type (eg.`float32`), and has an underlying array that holds that variable's data. 

#### Attributes
- Data is commonly associated with related "metadata" -- data that describes data. For example, the `floatingice` variable has an attribute `description : floating ice mask, 0 = non-floating-ice, 1 = floating-ice` that tells you how to interpret its values. All array-based Xarray objects (data variables, coordinate variables, DataArrays and Datasets) can have attributes attached to them.

In [6]:
datacube.floatingice

Unnamed: 0,Array,Chunk
Bytes,2.65 MiB,2.65 MiB
Shape,"(834, 834)","(834, 834)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.65 MiB 2.65 MiB Shape (834, 834) (834, 834) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",834  834,

Unnamed: 0,Array,Chunk
Bytes,2.65 MiB,2.65 MiB
Shape,"(834, 834)","(834, 834)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


#### Other Coordinate Variables

Metadata can take the form of dimensional arrays too. For example, the `satellite_img1` and `satellite_img2` arrays record the satellite sources for the image pair used to construct the velocity data. This is important *metadata* about the observed velocity fields. Such variables can be set as "non-dimension coordinate variables" if desired, though we will not do so here. 

In [7]:
datacube.satellite_img1

Unnamed: 0,Array,Chunk
Bytes,1.06 MiB,1.06 MiB
Shape,"(138421,)","(138421,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,,
"Array Chunk Bytes 1.06 MiB 1.06 MiB Shape (138421,) (138421,) Dask graph 1 chunks in 2 graph layers Data type",138421  1,

Unnamed: 0,Array,Chunk
Bytes,1.06 MiB,1.06 MiB
Shape,"(138421,)","(138421,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,,


:::{tip}
If you haven't yet, review the {term}`Metadata naming` and {term}`Climate Forecast (CF) Metadata Conventions` sections of the [Relevant Concepts](../../background/6_relevant_concepts.md) page.
:::

## B. Read ITS_LIVE data from AWS S3 using Xarray

Now that we know a bit more about the ITS_LIVE dataset, we can start querying the catalog to access the data we're interested in.

### 1) Overview of ITS_LIVE data storage and catalog

The ITS_LIVE project details a number of data access options on their [website](https://its-live.jpl.nasa.gov/#access). Here, we will be accessing ITS_LIVE data in the form of [zarr](https://zarr.readthedocs.io/en/stable/) data cubes that are stored in [S3 buckets](https://registry.opendata.aws/its-live-data/) hosted by Amazon Web Services (AWS). There is a [AWS S3 explorer index](https://its-live-data.s3.amazonaws.com/index.html) that we will use to query the data catalog. There, you can browse the contents of the bucket in the AWS S3 Explorer. Click this [link](https://its-live-data.s3.amazonaws.com/datacubes/catalog_v02.json) to download the file.

:::{tip}
You can also use the ITS_LIVE API to access ITS_LIVE data cube urls corresponding to different search conditions as well as Python code provided on the ITS_LIVE [website](https://its-live.jpl.nasa.gov/#api). We go through the steps of looking at the catalog in order to get a better understanding of how S3 buckets are organized. 
:::

To query the data stored in the bucket, we will download the `catalog_v02.json` that is located in the bucket linked above. 

#### Understanding the data

The first step in working with a new dataset is understanding how it is organized. To query the data stored in the bucket, we will download the `catalog_v02.json` that is linked above. This catalog contains spatial information and properties of ITS_LIVE data cubes as well as the URL used to access each cube. Let's take a look at the entry for a single data cube and the information that it contains:

```{image} ../imgs/screengrab_itslive_catalog_entry.png
:center-align
```

The top portion of the picture shows the spatial extent of the data cube in lat/lon units. Below that, we have properties such as the [EPSG code of the coordinate reference system](https://en.wikipedia.org/wiki/EPSG_Geodetic_Parameter_Dataset), the spatial footprint in projected units, and the url of the zarr object. 

Let's take a look at the url more in-depth: 
```
http://its-live-data.s3.amazonaws.com/datacubes/v2-updated-october2024/S40E170/ITS_LIVE_vel_EPSG32759_G0120_X450000_Y5250000.zarr
```

From this link we can see that we are looking at ITS_LIVE data located in an s3 bucket hosted by Amazon Web Services (AWS). We also see that we're looking in the version 2 data cube directory. The next bit gives us information about the global location of the cube (N40E080). The actual file name `ITS_LIVE_vel_EPSG32645_G0120_X250000_Y4750000.zarr` tells us that we are looking at ice velocity data (its_live also has elevation data), in the CRS associated with EPSG 32645 (this code indicates UTM zone 45N). X250000_Y4750000 tells us more about the spatial footprint of the datacube within the UTM zone. 


### 2) Read ITS_LIVE data from S3 storage into memory

We've found the url associated with the tile we want to access, let's try to open the data cube using `Xarray.open_dataset()`:

In [8]:
url = "http://its-live-data.s3.amazonaws.com/datacubes/v2/N30E090/ITS_LIVE_vel_EPSG32646_G0120_X750000_Y3350000.zarr"

In addition to passing `url` to `xr.open_dataset()`, we include `chunks='auto'`. This introduces [dask](https://www.dask.org/) into our workflow; `chunks='auto'` will choose chunk sizes that match the underlying data structure; this is often ideal, but sometimes you may need to specify different chunking schemes. You can read more about choosing good chunk sizes [here](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes); subsequent notebooks in this tutorial will explore different approaches to dask chunking. 

In [9]:
dc = xr.open_dataset(url, decode_timedelta=True)

syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <?xml^ version="1.0" encoding="UTF-8"?><Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>its-live-data.s3-us-west-2.amazonaws.com</Endpoint><Bucket>its-live-data</Bucket><RequestId>N4K18PDJ5AXTVHG0</RequestId><HostId>T09JW3wysyNegZJIgK2n0olT8TKkjRaENRVqr5Qi7b4Nbe2HwA8PvFf+bPMaQizLkJmP5SYQqQs=</HostId></Error>


OSError: [Errno -72] NetCDF: Malformed or inaccessible DAP2 DDS or DAP4 DMR response: 'http://its-live-data.s3.amazonaws.com/datacubes/v2/N30E090/ITS_LIVE_vel_EPSG32646_G0120_X750000_Y3350000.zarr'

As you can see, this doesn’t quite work. When passing the url to `xr.open_dataset()`, if a backend isn’t specified, Xarray will expect a NetCDF file. Because we’re trying to open a Zarr file we need to add an additional argument to `xr.open_dataset()`, shown in the next code cell. You can find more information [here](https://docs.xarray.dev/en/stable/user-guide/io.html#cloud-storage-buckets). Another approach we could use is to read the data with the Zarr-specific method [`xr.open_zarr()`](https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html) instead of [`xr.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html).

We set `decode_coords="all"` so that Xarray will auto-detect a number of variables as coordinate variables --- these are variables that are usually describing properties that are common to many "data variables". In our case, it picks up the `mapping` variable which describes the Coordinate Reference System for this datacube.

In [None]:
dc = xr.open_dataset(url, engine="zarr", chunks="auto", decode_timedelta=False, decode_coords="all")
dc

This one worked! Let's stop here and define a function that we can use to read additional s3 objects into memory as Xarray Datasets. This will come in handy later in this notebook and in subsequent notebooks. We will store this and other utility functions in `itslive_tools.py` for reuse across notebooks.

In [None]:
def read_in_s3(http_url: str, chunks: str | dict | None = "auto") -> xr.Dataset:
    """I'm a function that takes a url pointing to the location of a zarr data cube.
    I return an Xarray Dataset. I can take an optional chunk argument which specifies
    how the data will be chunked when read into memory"""
    datacube = xr.open_dataset(
        http_url,
        engine="zarr",
        chunks=chunks,
        decode_coords="all",
        decode_timedelta=False,
    )
    return datacube

### 3) Check spatial footprint of data

We just read in a very large dataset. 

We'd like an easy way to be able to visualize the footprint of this data to ensure we specified the correct location without plotting a data variable over the entire footprint, which would be much more computationally and time-intensive. 

To do so, we need to understand the coordinate system of the data, and its bounds.

This dataset has its coordinate system info stored in an array named `mapping`. How would you know that? Scroll through the Xarray Dataset repr, and check the attributes. Variables with CRS information tend to have the `crs_wkt`, `grid_mapping`, `GeoTransform` and related attributes that describe the coordinate system.

In [None]:
dc.mapping

The following function creates a `GeoPandas.GeoDataFrame` describing the spatial footprint of an `xr.Dataset`. 

In [None]:
def get_bounds_polygon(input_xr: xr.Dataset) -> gpd.GeoDataFrame:
    """I'm a function that takes an Xarray Dataset and returns a GeoPandas DataFrame of the bounding box of the Xarray Dataset."""

    xmin = input_xr.coords["x"].data.min()
    xmax = input_xr.coords["x"].data.max()

    ymin = input_xr.coords["y"].data.min()
    ymax = input_xr.coords["y"].data.max()

    pts_ls = [(xmin, ymin), (xmax, ymin), (xmax, ymax), (xmin, ymax), (xmin, ymin)]

    crs = f"epsg:{input_xr.mapping.spatial_epsg}"

    polygon_geom = Polygon(pts_ls)
    polygon = gpd.GeoDataFrame(index=[0], crs=crs, geometry=[polygon_geom])

    return polygon

Now let's take a look at the cube we've already read:

In [None]:
bbox = get_bounds_polygon(dc)

`get_bounds_polygon()` returns a geopandas.GeoDataFrame object in the same projection as the velocity data object (local UTM). Re-project to latitude/longitude to view the object more easily on a map:

In [None]:
bbox = bbox.to_crs("EPSG:4326")

To visualize the footprint, we use the interactive plotting library, [hvPlot](https://hvplot.holoviz.org/).

In [None]:
poly = bbox.hvplot(legend=True, alpha=0.3, tiles="ESRI", color="red", geo=True)
poly

## C. Query ITS_LIVE catalog

### 1) Find ITS_LIVE granule for a point of interest
Let's look in a different region and see how we could search the ITS_LIVE data cube catalog for the granule that covers our location of interest. There are many ways to do this, this is just one example. 

First, we read in the catalog GeoJSON file with geopandas:

In [None]:
itslive_catalog = gpd.read_file("https://its-live-data.s3.amazonaws.com/datacubes/catalog_v02.json")
itslive_catalog

Below is a function to query the catalog for the s3 url covering a given point. You could easily tweak this function (or write your own!) to select granules based on different properties. Play around with the `itslive_catalog` object to become more familiar with the data it contains and different options for indexing.

:::{note}
Since this tutorial was originally written, the [ITS_LIVE Python Client](https://github.com/nasa-jpl/itslive-py) was released. This is a great way to access ITS_LIVE data cubes. 
:::


In [None]:
def find_granule_by_point(input_point: list) -> str:
    """I take a point in [lon, lat] format and return the url of the granule containing specified point.
    Point must be passed in EPSG:4326."""

    catalog = gpd.read_file("https://its-live-data.s3.amazonaws.com/datacubes/catalog_v02.json")

    # make shapely point of input point
    p = gpd.GeoSeries([Point(input_point[0], input_point[1])], crs="EPSG:4326")
    # make gdf of point
    gdf = gdf = gpd.GeoDataFrame({"label": "point", "geometry": p})
    # find row of granule
    granule = catalog.sjoin(gdf, how="inner")

    url = granule["zarr_url"].values[0]
    return url

Choose a location in Alaska:

In [None]:
url = find_granule_by_point([-138.958776, 60.748561])
url

Great, this function returned a single url corresponding to the data cube covering the point we supplied. Let's use the `read_in_s3` function we defined to open the datacube as an `xarray.Dataset`

In [None]:
datacube = read_in_s3(url)
datacube

### 2) Read + visualize spatial footprint of ITS_LIVE data

Use the `get_bounds_polygon()` function to take a look at the footprint using `hvplot()`.

In [None]:
bbox_dc = get_bounds_polygon(datacube)

In [None]:
poly = bbox_dc.to_crs("EPSG:4326").hvplot(legend=True, alpha=0.5, tiles="ESRI", color="red", geo=True)
poly

## Conclusion
This notebook demonstrated how to query and access a cloud-optimized remote sensing time series dataset stored in an AWS S3 bucket. The subsequent notebooks in this tutorial will go into much more detail on how to organize, examine and analyze this data. 