Getting Started with Cloud-Native HLS Data in Python:

Extracting an EVI Time Series from Harmonized Landsat-8 Sentinel-2 (HLS) data in the Cloud using CMR's SpatioTemporal Asset Catalog (CMR-STAC)

This tutorial demonstrates how to work with the HLS (HLSS30.015) data product.

The Harmonized Landsat and Sentinel-2 (HLS) project produces seamless, harmonized surface reflectance data from the Operational Land Imager (OLI) and Multi-Spectral Instrument (MSI) aboard Landsat-8 and Sentinel-2 Earth-observing satellites, respectively. The aim is to produce seamless products with normalized parameters, which include atmospheric correction, cloud and cloud-shadow masking, geographic co-registration and common gridding, normalized bidirectional reflectance distribution function, and spectral band adjustment. This will provide global observation of the Earth’s surface every 2-3 days with 30 meter spatial resolution. One of the major applications that will benefit from HLS is agriculture assessment and monitoring, which is used as the use case for this tutorial.

NASA's Land Processes Distributed Active Archive Center (LP DAAC) archives and distributes HLS products in the LP DAAC Cumulus cloud archive as Cloud Optimized GeoTIFFs (COG). This tutorial will demonstrate how to query and subset HLS data using the NASA Common Metadata Repository (CMR) SpatioTemporal Asset Catalog (STAC) application programming interface (API). Because these data are stored as COGs, this tutorial will teach users how to load subsets of individual files into memory for just the bands you are interested in--a paradigm shift from the more common workflow where you would need to download a .zip/HDF file containing every band over the entire scene/tile. This tutorial will also cover how to process HLS data (quality filtering and EVI calculation), visualize, and "stack" the scenes over a region of interest into an xarray data array, calculate statistics for an EVI time series, and export as a comma-separated values (CSV) file--providing you with all of the information you need for your area of interest without having to download the source data file. The Enhanced Vegetation Index (EVI), is a vegetation index similar to NDVI that has been found to be more sensitive to ground cover below the vegetated canopy and saturates less over areas of dense green vegetation.

Disclaimer: This tutorial currently uses the PROVISIONAL Version 1.5 daily 30 meter (m) global Harmonized Landsat Sentinel-2 (HLS) Sentinel-2 Multi-spectral Instrument Surface Reflectance (HLSS30) data. Future iterations of this tutorial will also include the daily 30 meter (m) global Harmonized Landsat Sentinel-2 (HLS) Landsat-8 OLI Surface Reflectance (HLSL30) data.


Use Case Example:

This tutorial was developed using an example use case for crop monitoring over a single large farm field in northern California. The goal of the project is to observe HLS-derived mean EVI over a farm field in northern California without downloading the entirety of the HLS source data.

This tutorial will show how to use the CMR-STAC API to investigate the HLS collection available in the cloud and search for and subset to the specific time period, bands (layers), and region of interest for our use case, load subsets of the desired COGs into a Jupyter Notebook directly from the cloud, quality filter and calculate EVI, stack the time series, visualize the time series, and export a CSV of statistics on the EVI of the single farm field.


Data Used in the Example:


Topics Covered:

  1. Getting Started
    1.1 Import Packages and Set up the Working Environment
  2. Navigating the CMR-STAC API
    2.1 Introduction to the CMR-STAC API
  3. CMR-STAC API: Searching for Items
    3.1 Spatial Querying via Bounding Box
    3.2 Temporal Querying
  4. Extracting HLS COGs from the Cloud
    4.1 Subset by Band
    4.2 Load a Spatially Subset HLS COG into Memory
  5. Processing HLS Data
    5.1 Apply Scale Factor and Calculate EVI
    5.2 Quality Filtering
    5.3 Export to COG
  6. Automation
  7. Stacking HLS Data
    7.1 Open COGs and Stack Using Xarray
    7.2 Visualize Stacked Time Series
    7.3 Export Statistics

Before Starting this Tutorial:

Setup and Dependencies

It is recommended to use Conda, an environment manager to set up a compatible Python environment. Download Conda for your OS here: https://www.anaconda.com/download/. Once you have Conda installed, Follow the instructions below to successfully setup a Python environment on Linux, MacOS, or Windows.

This Python Jupyter Notebook tutorial has been tested using Python version 3.7. Conda was used to create the python environment.

If you do not have jupyter notebook installed, you may need to run:

conda install jupyter notebook

TIP: Having trouble activating your environment, or loading specific packages once you have activated your environment? Try the following:

Type: conda update conda or conda update --all

Still having trouble getting a compatible Python environment set up? Contact LP DAAC User Services at: https://lpdaac.usgs.gov/lpdaac-contact-us/

If you prefer to not install Conda, the same setup and dependencies can be achieved by using another package manager such as pip.


You will need to have a netrc file set up in your home directory in order to successfully run the code below. Check out the Setting up a netrc File section in the README.

Source Code used to Generate this Tutorial:

The repository containing all of the required files is located at: https://git.earthdata.nasa.gov/projects/LPDUR/repos/hls-tutorial/browse

NOTE: The data used in this tutorial is PROVISIONAL and should be used for educational purposes only. These data have not yet been validated for their science quality and should not be used in science research or applications.


1. Getting Started

1.1 Import Packages

Import the required packages and set the input/working directory to run this Jupyter Notebook locally.


2. Navigating the CMR-STAC API

Learn about navigating NASA's Common Metadata Repository (CMR) SpatioTemporal Asset Catalog (STAC) API.

2.1 Introduction to the CMR-STAC API

What is STAC?

STAC is a specification that provides a common language for interpreting geospatial information in order to standardize indexing and discovering data.

Four STAC Specifications:

  1. STAC API
  2. STAC Catalog
  3. STAC Collection
  4. STAC Item
    #### In the section below, we will walk through an example of each. For additional information, check out: https://stacspec.org/.

1. STAC API: Endpoint that enables the querying of STAC items.

Below, set the CMR-STAC API Endpoint to a variable, and use the requests package to send a GET request to the endpoint, and set the response to a variable.

You will notice above that the CMR-STAC API contains many different endpoints--not just from NASA LP DAAC, but also contains endpoints for other NASA ESDIS DAACs.

Below, search for LP DAAC Catalogs, and print the information contained in the Catalog that we will be using today, LPCLOUD.

3. STAC Collection: Extension of STAC Catalog containing additional information that describe the STAC Items in that Collection.

Below, get a response from the LPCLOUD Collection and print the information included in the response.

Currently there is only one collection available, but more will be added in the future.

In CMR, Concept ID is used to query by a specific product, so be sure to save the Concept ID for the HLS S30 V1.5 product below:

Explore the attributes contained in the HLSS30 Collection.

So here we can see that the extent is global, and can also see the temporal range--where "None" means on-going or to present.

4. STAC Item: Represents data and metadata assets that are spatiotemporally coincident

Below, query the collection for items and return the first item in the collection.

STAC metadata provides valuable information on the item, including a unique ID, when it was acquired, the location of the observation, and a cloud cover assessment.

Below, print out the ten items and the percent cloud cover--we will use this to decide which item to visualize in the next section.

Using the information printed above, set the item_index below to whichever observation is the least cloudy above.

Below, print out the names of all of the assets included in this item.

Notice that each HLS item includes a browse image. Read the browse file into memory and visualize the HLS acquisition.

Use the skimage package to load the browse image into memory and matplotlib to quickly visualize it.

Congrats! You have pulled your first HLS asset from the cloud using STAC!

Below, we will remove variable that were set in sections 1-2 that we will no longer use for the duration of the tutorial. This is good practice in order to keep a clean environment and to save space in memory.


3. CMR-STAC API: Searching for Items

In this section, instead of simply navigating through the structure of a STAC Catalog, use the search endpoint to query the API by region of interest and time period of interest.

3.1 Spatial Querying via Bounding Box

Grab the search endpoint for the LPCLOUD STAC Catalog and send a GET request to retrieve items.

If we just call the search endpoint directly, it will default to returning the 1st 10 granules. Below, set a limit to return the first 100 matching items. Additional information on the spec for adding parameters to a search query can be found at: https://github.com/radiantearth/stac-api-spec/blob/master/api-spec.md#filter-parameters-and-fields.

Next, load in the spatial region of interest for our use case using geopandas. You will need to have downloaded the Field_Boundary.geojson from the repo, and it must be stored in the current working directory in order to continue.

Note: if the cell above has failed, make sure that you have downloaded the Field_Boundary.geojson file from the HLS-Tutorial Repository. You will need to make sure that the file is saved in the same directory as the directory that you are running the tutorial in. If you are still encountering issues, you can add the entire filepath to the file (ex: field = gp.read_file('C:/Username/HLS-Tutorial/Field_Boundary.geojson') and try again.

Plot the geometry of the farm field boundaries.

Below, combine a plot of the farm field boundary (combine two geoviews plots using *) with a basemap layer.

The farm field used in this example use case is located northwest of Chico, CA.

Now, add the bounding box of the region of interest to the CMR-STAC API Search query using the bounding_box parameter.

3.2 Temporal Querying

Finally, you can narrow your search to a specific time period of interest using the datetime parameter. Here we have set the time period of interest to all of 2020. Additional information on setting temporal searches can be found in the NASA CMR Documentation.

Currently only the PROVISIONAL daily 30 meter (m) global Harmonized Landsat Sentinel-2 (HLS) Sentinel-2 Multi-spectral Instrument Surface Reflectance (HLSS30) Version 1.5 data are publicly available. Once the HLS Operational Land Imager Surface Reflectance and TOA Brightness (HLSL30) product is provisionally released this tutorial will be updated to show how to combine observations from both products into a time series. To be the first to know once those data have been made available, and once the tutorial has been updated, please sign up for the LP DAAC listserv.


4. Extracting HLS COGs from the Cloud

In this section, configure gdal and rasterio to use vsicurl to access the cloud assets that we are interested in, and read them directly into memory without needing to download the files.

4.1 Subset by Band

View the contents of an S30 item.

Sentinel 2:

- "narrow" NIR = B8A
- Red = B04
- Blue = B02  
- Quality = Fmask

Remember from above that you can always quickly load in the browse image to get a quick view of the item.

Above, we see a nice cloud-free observation over the northern Central Valley of California.

4.2 Load a Spatially Subset HLS COG into Memory

Before loading the COGs into memory, run the cell below to check and make sure that you have a netrc file set up with your NASA Earthdata Login credentials, which will be needed to access the HLS files in the cells that follow. If you do not have a netrc file set up on your OS, the cell below should prompt you for your NASA Earthdata Login username and password.

Read the file using rasterio.

Getting an error in the section above? Accessing these files in the cloud requires you to authenticate using your NASA Earthdata Login account. You will need to have a netrc file set up containing those credentials in your home directory in order to successfully run the code below. Check out the Setting up a netrc File section in the README.

Below, take the farm field shapely polygon and convert it from lat/lon (EPSG: 4326) into the native projection of HLS, UTM (aligned to the Military Grid Reference System). This must be done in order to use the Region of Interest (ROI) to subset the COG that is being pulled into memory--it must be in the native projection of the data being extracted.

Now, we can use the ROI to mask any pixels that fall outside of it and crop to the bounding box using rasterio. This greatly reduces the amount of data that are needed to load into memory.

Above, you can see that the data have been loaded into memory already subset to our ROI. Next, load in the red and blue bands.


5. Processing HLS Data

In this section, read the file metadata to retrieve and apply the scale factor, filter out nodata values, define a function to calculate EVI, and execute the EVI function on the data loaded into memory. After that, perform quality filtering to screen out any poor quality observations.

5.1 Apply Scale Factor and Calculate EVI

Read the file metadata.

Apply the scale factor to each array, and set the no-data values to nan.

Define a function for calculating EVI from the NIR, Red, and Blue bands.

Below, apply the EVI function on the scaled data.

Go back to the STAC Item and grab the observation date to add to the plot below.

Next, plot the results using matplotlib.

Above, notice that our farm field boundary actually appears to be comprised of two separate fields, one being much greener than the other.

5.2 Quality Filtering

In this section, load in the Fmask quality layer, breakdown the bit-encoded information, define good quality observations, and apply a mask to the EVI layer using pixels with good quality as defined by the Fmask.

fmask_quality_table.png

The quality table above can be found in section 6.5 of the HLS V1.5 User Guide.

Looking at the table above, using the left column, you can see there are 7 bitwords, or bit groupings for this quality layer. Below, set the number of bitwords and the number of bits per bitword.

Next, list all of the unique values present in the quality layer, and convert from decimal to binary form. The section below will loop through and convert each quality value present in the list to its binary form, and use the bitword order and information from the quality table above to determine if a value is deemed "good quality".

For this exercise, we will define the following as good quality:

- Cloud == 0 (No Cloud)  
- Cloud shadow == 0 (No Cloud shadow)  
- Snow/ice == 0 (No Snow/ice present)  
- Water == 0 (No Water present)

So if we look above, a quality value of 0 indicates the following:

Cloud = No
Cloud Shadow = No
Snow/ice = No
Water = No

Indicating that any pixel with a quality value of 0 is good quality.

Above, only pixels with an Fmask value of 0 will be included after masking.

Below, mask the EVI array using the quality mask. Notice that invert is set to "True" here, meaning that you do not want to exclude the good quality value of 0, but rather exclude all other values.

5.3 Export to COG

In this section, create an output filename and export the quality filtered EVI to COG.

The standard format for HLS S30 V1.5 filenames is as follows:

HLS.S30: Product Short Name
T10TEK: MGRS Tile ID (T+5-digits)
2020190T184919: Julian Date and Time of Acquisition (YYYYDDDTHHMMSS)
v1.5: Product Version
B8A: Spectral Band
.tif: Data Format (Cloud Optimized GeoTIFF)

For additional information on HLS naming conventions, be sure to check out the HLS Overview Page.

The process of successfully writing a valid COG needs to be completed in a certain order. Below, add the overviews to the output GeoTIFF first, and then add tiling and compression.

Now that the overviews have been created, re-open the file, copy the file attributes, add tiling and LZW compression, and write back to the output filename.


6. Automation

In this section, automate sections 4-5 for each HLS item that intersects our spatiotemporal subset of interest. Loop through each item and subset to the desired bands, load the spatial subset into memory, apply the scale factor, calculate EVI, quality filter, and export as a cloud optimized GeoTIFF.

Note: Be patient with the for loop below, it may take a few minutes to complete.

Now there should be multiple COGs exported to your working directory, that will be used in Section 7 to stack into a time series.


7. Stacking HLS Data

In this section, open multiple HLS S30-derived EVI COGs and stack them into an xarray data array.

7.1 Open and Stack using Xarray

Xarray extends and combines much of the core functionality from both the Pandas library and Numpy, hence making it very good at handling multi-dimensional (N-dimensional) datasets that contain labels (e.g., variable names or dimension names).

List the files in the current working directory.

Loop through and open each file, and add to an xarray data array.

Xarray has two fundamental data structures. A Dataset holds multiple variables that potentially share the same coordinates and global metadata for the file (see above). A DataArray contains a single multi-dimensional variable and its coordinates, attributes, and metadata. Data values can be pulled out of the DataArray as a numpy.ndarray using the values attribute.

Above, notice we have created a data array called EVI that has lat (y) and lon (x) coordinates, as well as the z dimension, which is time. This allows us to plot and visualize the HLS S30-derived EVI data as a time series sequentially by time and in geographic space using the lat/lon coordinates.

Looking to learn more specifically about using xarray for time series analysis? Check out the Using the AρρEEARS API in a Landsat ARD Workflow - Getting Started Jupyter Notebook tutorial.

7.2 Visualize Stacked Time Series

Below, use the hvPlot and holoviews packages to create an interactive time series plot of the HLS S30-derived EVI data.

Next, add a basemap layer to provide better context of the areas surrounding our region of interest.

Looking at the time series above, this farm field is likely a field of walnut trees. Notice the higher EVI (greens) during the summer and browning that is likely occurring from dry conditions as the time series progresses later into autumn.

Since the data is in an xarray we can intuitively slice or reduce the dataset. Let's select a single time slice from the EVI variable.

You can use slicing to plot data only for a specific observation, for example.

Now, plot the entire time series as boxplots showing the distribution of EVI values for our farm field.

Again, the statistics appear to support the conclusion that greener conditions prevailed in the summer of 2020 vs. later in autumn.

7.3 Export Statistics

Next, calculate statistics for each observation and export to CSV.

We now have the mean and standard deviation for each time slice as well as the maximum and minimum values. Let's do some plotting! We will use the hvPlot package to create simple but interactive charts/plots. Hover your curser over the visualization to see the data values.

Remember that these graphs are also interactive--hover over the line to see the value for a given date.

Finally, create a pandas dataframe with the statistics, and export to a CSV file.

Success! You have now not only learned how to get started with HLS V1.5 data, but have also learned how to navigate cloud-native data using STAC, how to access subsets of COGs, and how to write COGs for your own outputs. Using this jupyter notebook as a workflow, you should now be able to switch to your specific region of interest and re-run the notebook. Good Luck!

Contact Information

Material written by Cole Krehbiel1

    Contact: LPDAAC@usgs.gov
    Voice: +1-605-594-6116
    Organization: Land Processes Distributed Active Archive Center (LP DAAC)
    Website: https://lpdaac.usgs.gov/
    Date last modified: 11-20-2020
1KBR Inc., contractor to the U.S. Geological Survey, Earth Resources Observation and Science (EROS) Center, Sioux Falls, South Dakota, 57198-001, USA. Work performed under USGS contract G15PD00467 for LP DAAC2. 2LP DAAC Work performed under NASA contract NNG14HH33I.