5.3 Data Cubes Revisited#
In this book, we saw a range of real-world datasets and the steps required to prepare them for analysis. Several guiding principles for assembling and using analysis-ready data cubes in Xarray can be drawn from these examples.
Let’s first return to the Xarray building blocks described in the background section; we can now provide more-detailed definitions of what they are and how they should be used:
Dissecting data cubes
Dimensions - What do the axes of the data represent? This should be the set of dimensions. Frequently, (x, y, time)
. Dimensions are orthogonal to each other.
Dimensional coordinate variables - Typically 1-d arrays describing the range and resolution of the data along each dimension. Think of this as axes tick labels on a plot.
Non-dimensional coordinate variables - Metadata about the physical observable that varies along one or more dimensions. These can be 1-d up to n-d where n is the length of .dims
.
Data variables - Physical observable(s) whose values are known at every point on the grid formed by the dimensions.
Attributes - Metadata that can be assigned to a given xr.Dataset
or xr.DataArray
that is invariant along that object’s dimensions.
Tidying guidelines for Xarray#
At the beginning of the book, we also discussed ‘tidy data’ as its defined for tabular data. Now that we’ve worked through a number of examples preparing n-dimensional array data for analysis, we can enumerate a few best practices that apply in most cases when tidying data with Xarray. The guiding question when thinking about how to tidy data is always:
How can this data be structured to simplify subsequent analysis?
Keep in mind that one organization of the data need not make all analyses equally ergonomic. We must be open to transforming the data between equivalent representations, depending on the task at hand.
Here are a few guidelines:
Data variables
These are the measurements or estimates of your dataset. If there are multiple independent measurements in the dataset, they should be stored as data variables in a xr.Dataset
, if the data are univariate, use a xr.DataArray
Guiding Question |
What physical observable(s) is my dataset measuring? |
Relevant examples
In Sentinel-1, notebook 3 - Exploratory analysis of ASF data, we saw examples of treating multiple backscatter polarizations as data variables versus a single variable along a band
dimension. We can convert between the two representations using Dataset.to_array
and DataArray.to_dataset
.
Takeaway |
Data variables should be independent of one another. |
Dimensions
Many earth observation data cubes have the same core structure of (x,y,time)
or (longitude, latitude, time)
dimensions. However, the optimal structure for a given dataset depends on its intended use-case, and there is considerable room to reshape data cubes to fit specific analytical needs.
In these situations, a guiding question could be:
Guiding Question |
What ‘shape’ will help me answer the questions I have about this dataset? |
Relevant examples
1. Expanding dimensions v. adding variables
Formatting the Sentinel-1 backscatter cube to have
vv
andvh
data variables versusband
dimension with the following coordinate array:('band', ['vh','vv'])
(Sentinel-1 tutorial, notebook 3 - exploratory analysis of ASF data).If you are only interested in a single polarization of the dataset or looking at backscatter from different polarizations independent of one another, treating backscatter from each polarization as a _data variable is suitable and maybe even optimal; it can be simpler to perform operations on a single variable rather than an entire dimension.
If you are interested in examining backscatter across different polarizations, the different polarizations are most appropriately represented as elements of a dimension.
Takeaway |
Structure your datacube’s dimensions so that data variables are independent of one another. |
Tip
If you are working with a dataset where information about how the variables relate to one another is included in the variable name (e.g. a year, or a band wavelength), this is a sign that there should be an additional dimension.
2. Compare two datasets by combining them into a single cube with an additional dimension
To compare data from different satellites within the ITS_LIVE dataset, we create a new data cube with a
'sensor'
dimension (ITS_LIVE tutorial, notebook 4 - exploratory analysis of a single glacier*).Adding
source
dimension when comparing ASF and PC backscatter datasets (Sentinel-1 tutorial, notebook 5 - comparing backscatter datasets).In this example, the goal of our analysis changes from observing backscatter to observing how measurements of backscatter from two processing pipelines differ from one another. This implies a different shape of the data that is relevant to this question; the appropriate dimensions change from
(x, y, time, band)
to(x, y, time, band, source)
.Adding a source dimension let’s us index the combined dataset by ‘source’ and compare the two ‘source’ elements on a common grid and scale.
Takeaway |
Consider either concatenating two cubes along a new dimension, or splitting a dimension in to multiple cubes. One approach may be more ergonomic compared to the other depending on the problem at hand. |
Dimensional & non-dimensional coordinate variables
A dataset must have a dimensional coordinate variable for each dimension in ds.dims
. Additional non-dimensional coordinate variables should be added when relevant metadata varies over dimension(s) of the dataset.
Guiding Question |
What are the dimensions of the dataset? What information (separate from the measurement variable) varies over those dimensions? |
Relevant examples
1. Handling time-varying metadata
Metadata that varies over
(time)
should be stored as coordinate variables along thetime
dimension (e.g. whether a scene was taken during an ascending or descending pass).Metadata that varies over
time
,x
, andy
should be coordinate variables that exist along those dimensions.
Takeaway |
Assign metadata that varies along a given dimension as a non-dimensional coordinate of that dimension. |
Attributes
attrs
can be assigned to the dataset as a whole or any of the xr.DataArray
objects within it. Many fields have their own conventions for attribute metadata, e.g. Climate & Forecast Conventions (CF).
Guiding Question |
Does a piece of attribute information apply to this entire object (e.g. a data variable, a coordinate variable, or a dataset)? If so, it should be stored as an attribute of that object. Attributes must conform to an existing standard if possible. |
and
Guiding Question |
What tools exist that can help perform the operations that I need to with this dataset? How must attribute data be stored to use them? |
Relevant examples
1. Attributes must conform to accepted metadata conventions like CF and STAC in order to take advantage of tools built off these specifications
Using
cf_xarray
with appropriately-formatted metadata enables more streamlined access to and interpretation of metadata (ITS_LIVE tutorial, data access notebook)Having appropriate CF metadata enables reading and writing vector data cubes to disk (ITS_LIVE tutorial, exploratory analysis of a group of glaciers notebook)
Takeaway |
Wherever possible, use metadata that conforms to existing conventions. |
Collections
Independent objects should be represented as unique xr.Datasets
(if multivariate) or xr.DataArrays
(if univariate). If you are working with a collection of independent objects but would like to organize, keep track of, and work with them in relation to one another, use xr.DataTree
to assign hierarchical relationships among objects.
Relevant examples
1. Apply a function to every object in a collection with xr.DataTree.map_over_datasets()
Using
xr.DataTree
to apply a function to each dataset within a a collection in order to make a new datacube that includes all objects along an expanded dimension (ITS_LIVE tutorial, notebook 4 - exploratory analysis of a single glacier notebook).
2. If you’re working with a collection of objects that can be defined by vector geometries, a vector data cube may be an appropriate way to represent the data
Use Xvec to build a vector data cube that has a
'geometry'
dimension; each element of the geometry dimension is a cube that varies over the other dimensions of the cube (frequentlytime
) (ITS_LIVE tutorial, exploratory analysis of a group of glaciers notebook.)