Introduction
1. Learning XArray
2. Scaling Computation with Dask
- Calculating Zonal Statistics
License

Note: This course is under active development and not complete yet!

Introduction

This is an advanced-level course that teaches you how to use open-source Python libraries to process earth observation dataset using cloud and parallel-computing technologies. We have designed this course to help you build the required skills in a structured way and offer flexibility in where your data is stored and how it is processed.

1. Learning XArray

XArray has emerged as one of the key Python libraries to work with gridded raster datasets. It can natively handle time-series data making it ideal for working with Remote Sensing datasets. It builds on NumPy/Pandas for fast arrays/indexing and is orders of magnitude faster than other Python libraries like rasterio. It has a growing ecosystem of extensions rioxarray, xarray-spatial, XEE and more allowing it to be used for geospatial analysis. In this section, we will learn the fundamentals of XArray with a focus on earth observation data processing. XArray offers the flexibility to seamlessly work with local datasets along with cloud-hosted datasets in a variety of optimized data formats.

Computing Indices

Overview

In this section, we will take a single Sentinel-2 L1C scene downloaded from the Copernicus Browser and learn how to read it using XArray, visualize it and compute spectral indices.

Setup and Data Download

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
  !pip install rioxarray
  !pip install jupyter-server-proxy

import os
import matplotlib.pyplot as plt
import xarray as xr
import rioxarray as rxr
import zipfile
import geopandas as gpd

from dask.distributed import Client
client = Client()  # set up local cluster on the machine
client

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

s2_scene = 'S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE.zip'
data_url = 'https://storage.googleapis.com/spatialthoughts-public-data/'

download(data_url + 's2/' + s2_scene)

aoi = 'bangalore.geojson'
download(data_url + aoi)

Data Pre-Processing

We first unzip the zip archive and create a XArray Dataset from the individual band images.

s2_filepath = os.path.join(data_folder, s2_scene)

with zipfile.ZipFile(s2_filepath) as zf:
  zf.extractall(data_folder)

Sentinel-2 images come as individual JPEG2000 rasters for each band. The image files are located in the GRANULE/{SCENE_ID}/IMG_DATA/ subfolder. We find the files and read them using rioxarray.

import glob
s2_folder = s2_filepath[:-4]

band_files = {}

for filepath in glob.glob(
    os.path.join(s2_folder, 'GRANULE', '*', 'IMG_DATA', '*B*.jp2')):
  filename = os.path.basename(filepath)
  # Extract the part of the filename containing band name such as 'B01'
  band_name = os.path.splitext(filename)[0].split('_')[-1]
  band_files[band_name] = filepath

band_files

{'B05': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B05.jp2',
 'B8A': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B8A.jp2',
 'B07': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B07.jp2',
 'B01': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B01.jp2',
 'B12': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B12.jp2',
 'B10': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B10.jp2',
 'B08': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B08.jp2',
 'B06': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B06.jp2',
 'B04': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B04.jp2',
 'B09': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B09.jp2',
 'B02': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B02.jp2',
 'B11': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B11.jp2',
 'B03': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B03.jp2'}

Different Sentinel-2 bands have different spatial resolutions. To put them in the same array, their dimensions must match. So we combine bands of similar resolutions and resample others to match them.

B04, B03, B02 and B08 = 10m
B12, B11, B07, B06, B05 and B08A = 20m
B10, B09, B1 = 60m

b4 = rxr.open_rasterio(python-remote-sensing-output/band_files['B04'], chunks=True)
b3 = rxr.open_rasterio(python-remote-sensing-output/band_files['B03'], chunks=True)
b2 = rxr.open_rasterio(python-remote-sensing-output/band_files['B02'], chunks=True)
b8 = rxr.open_rasterio(python-remote-sensing-output/band_files['B08'], chunks=True)

stack1 = xr.concat([b4, b3, b2, b8], dim='band').assign_coords(
    band=['B04', 'B03', 'B02', 'B08'])

b5 = rxr.open_rasterio(python-remote-sensing-output/band_files['B05'], chunks=True)
b6 = rxr.open_rasterio(python-remote-sensing-output/band_files['B07'], chunks=True)
b7= rxr.open_rasterio(python-remote-sensing-output/band_files['B07'], chunks=True)
b8a = rxr.open_rasterio(python-remote-sensing-output/band_files['B8A'], chunks=True)
b11 = rxr.open_rasterio(python-remote-sensing-output/band_files['B11'], chunks=True)
b12 = rxr.open_rasterio(python-remote-sensing-output/band_files['B12'], chunks=True)

stack2 = xr.concat([b5, b6, b7, b8a, b11, b12], dim='band').assign_coords(
    band=['B05', 'B06', 'B07', 'B8A', 'B11', 'B12'])

Now we reproject the bands to match the resolution of the 10m bands using reproject_match()

stack2 = stack2.rio.reproject_match(stack1)

Both band stacks now have the same resolution and hence the same X and Y dimensions. We can now combine them across the band dimension. The Sentinel-2 scenes come with NoData value of 0. So we set the correct NoData value before further processing.

scene = xr.concat([stack1, stack2], dim='band')
scene = scene.where(scene != 0)
scene

We will clip the scene to the geometry of the GeoJSON file. We must ensure that the projection of the GeoDataFrame and the XArray DataArray match.

aoi_path = os.path.join(data_folder, aoi)
aoi_gdf = gpd.read_file(aoi_path)
geometry = aoi_gdf.to_crs(scene.rio.crs).geometry
clipped = scene.rio.clip(geometry)
clipped

Sentinel-2 pixel values need to be converted to reflectances using the following formula:

ρ = (L1C_DN + RADIO_ADD_OFFSET) / QUANTIFICATION_VALUE

QUANTIFICATION_VALUE for all S2 scenes is 10000 and starting from January 22,2023, all new scenes have a RADIO_ADD_OFFSET of -1000

scaled = (clipped - 1000)/10000

Visualization

We can visualize a 3-band composite image.

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
scaled.sel(band=['B04', 'B03', 'B02']).plot.imshow(
    ax=ax,
    robust=True)
ax.set_title('RGB Visualization')
ax.set_axis_off()
plt.show()

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
scaled.sel(band=['B08', 'B04', 'B03']).plot.imshow(
    ax=ax,
    robust=True)
ax.set_title('NRG Visualization')
ax.set_axis_off()
plt.show()

Calculating Indices

red = scaled.sel(band='B04')
nir = scaled.sel(band='B08')

ndvi = (nir - red)/(nir + red)

Let’s visualize the results.

cbar_kwargs = {
    'orientation':'horizontal',
    'fraction': 0.025,
    'pad': 0.05,
    'extend':'neither'
}

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
ndvi.plot.imshow(
    ax=ax,
    cmap='Greens',
    robust=True,
    cbar_kwargs=cbar_kwargs)
ax.set_title('NDVI')
ax.set_axis_off()
plt.show()

green = scaled.sel(band='B03')
swir1 = scaled.sel(band='B11')

mndwi = (green - swir1)/(green + swir1)

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
mndwi.plot.imshow(
    ax=ax,
    cmap='Blues',
    robust=True,
    cbar_kwargs=cbar_kwargs)
ax.set_title('MNDWI')
ax.set_axis_off()
plt.show()

/usr/local/lib/python3.10/dist-packages/distributed/client.py:3160: UserWarning: Sending large graph of size 328.52 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(

red = scaled.sel(band='B04')
nir = scaled.sel(band='B08')

savi = 1.5 * ((nir - red) / (nir + red + 0.5))

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
savi.plot.imshow(
    ax=ax,
    cmap='Greens',
    robust=True,
    cbar_kwargs=cbar_kwargs)
ax.set_title('SAVI')
ax.set_axis_off()
plt.show()

Saving the results

files = {
    'clipped.tif': scaled,
    'ndvi.tif': ndvi,
    'mndwi.tif': mndwi,
    'savi.tif': savi
}

for file in files:
  output_path = os.path.join(output_folder, file)
  files[file].rio.to_raster(output_path, driver='COG')

Exercise

Calculate the Normalized Difference Built-Up Index (NDBI) for the image and visualize the built-up area using a ‘red’ palette

Hint: NDBI = (SWIR1 – NIR) / (SWIR1 + NIR)

Cloud Masking

Overview

In this section, we will take a single Landsat 8 scene downloaded from the USGS EarthExplorer and learn how to read it using XArray, visualize it and mask clouds.

Setup and Data Download

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
  !pip install rioxarray
  !pip install jupyter-server-proxy

import os
import matplotlib.pyplot as plt
import xarray as xr
import rioxarray as rxr
import zipfile
import numpy as np

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
landsat_scene = 'LC08_L2SP_144052_20190209_20200829_02_T1.zip'
data_url = 'https://storage.googleapis.com/spatialthoughts-public-data/'

download(data_url + 'landsat/' + landsat_scene)

Data Pre-Processing

We first unzip the zip archive and create a XArray Dataset from the individual band images.

landsat_filepath = os.path.join(data_folder, landsat_scene)

with zipfile.ZipFile(landsat_filepath) as zf:
  zf.extractall(data_folder)

Sentinel-2 images come as individual GeoTIFF rasters for each band. The image band files cotaining data from the OLI sensor are named such as *_SR_B1.TIF, *_SR_B2.TIF etc. We find the files and read them using rioxarray.

import glob

band_files = {}

for filepath in glob.glob(
    os.path.join(data_folder, '*SR_B*.TIF')):
  filename = os.path.basename(filepath)
  # Extract the part of the filename containing band name such as 'B01'
  band_name = os.path.splitext(filename)[0].split('_')[-1]
  band_files[band_name] = filepath

band_files

We read the bands using rioxarray and combine them into a single DataArray.

bands = []

for band_name, band_path in band_files.items():
  ds = rxr.open_rasterio(band_path, chunks=True, masked=True)
  ds.name = band_name
  bands.append(ds)

scene = xr.concat(bands, dim='band').assign_coords(
    band=list(python-remote-sensing-output/band_files.keys()))
scene

Landsat Collection 2 surface reflectance has a scale factor of 0.0000275 and an additional offset of -0.2 per pixel. We apply the scale and offset to get the reflectance values

scaled = (scene*0.0000275) - 0.2

Visualization

We can visualize a 3-band composite image.

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
scaled.sel(band=['B4', 'B3', 'B2']).plot.imshow(
    ax=ax,
    robust=True)
ax.set_title('RGB Visualization')
ax.set_axis_off()
plt.show()

Applying a Cloud Mask

Most satellite providers store pixel quality information as bitmasks. This allows the providers to store a lot more information in an efficient way. Bitmasks can be hard to understand. Please read our post on Working with QA Bands and Bitmasks in Google Earth Engine to understand the concepts. Here we apply the same concept using a Python implementation of bitmasking adapted from Google Earth Engine.

Read the QA band which comes with a filename ending with *QA_PIXEL.TIF.

for qa_filepath in glob.glob(
    os.path.join(data_folder, '*QA_PIXEL.TIF')):
  break

pixel_qa = rxr.open_rasterio(qa_filepath, chunks=True, masked=True)
pixel_qa.name = 'pixel_qa'
pixel_qa

pixel_qa = pixel_qa.squeeze().astype('int64')
pixel_qa

The 16-bit pixel value is determined by setting each of the bits to 0 or 1.

Since we are interested in clouds, the first 5 bits are relevant

Bit	Description
0	Fill
1	Dilated Cloud
2	Cirrus
3	Cloud
4	Cloud Shadow

We create a mask using the np.bitwise_and() function against the binary number 11111. The result will be 0 only if all the bits are set to 0 indicating clear conditions.

mask = xr.full_like(pixel_qa, int('11111', 2))
qa_mask = np.bitwise_and(pixel_qa, mask) == (0)
qa_mask

scene_masked = scaled.where(qa_mask)
scene_masked

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
scene_masked.sel(band=['B4', 'B3', 'B2']).plot.imshow(
    ax=ax,
    robust=True)
ax.set_title('RGB Visualization')
ax.set_axis_off()
plt.show()

Let’s write the resulting masked as a multi-band GeoTIFF file using rioxarray. We convert the DataArray to a DataSet with each band being a separate variable. This results in a much nicer GeoTIFF that is compatible with desktop software.

output_file = 'landsat_masked.tif'
output_path = os.path.join(output_folder, output_file)

output_ds = scene_masked.to_dataset('band')
output_ds[['B4', 'B3', 'B2']].rio.to_raster(output_path, compress='deflate')

Extracting Time Series

Overview

In this section, we will work with MODIS Vegetation Indices Version 6.1 data which are generated every 16 days at 250 meter (m) spatial resolution. This dataset provides Normalised Differnece Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) processed from the MODIS data. The source data was downloaded and processed via AρρEEARS which allows for pre-processing of the data into user-specified AOIs.

Here we will working with GeoTIFF files in EPSG:4326 projection and clipped to the state boundary of Karnataka, India. We will learn how to load all the files into a single XArray Dataset, apply a cloud-mask and extract a time-series of NDVI and EVI values.

Setup and Data Download

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !apt install libspatialindex-dev
    !pip install fiona shapely pyproj rtree
    !pip install geopandas
    !pip install rioxarray

import datetime
import glob
import os
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import xarray as xr
import rioxarray as rxr
import zipfile

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)


filename = 'modis_vegetation_indices_2020.zip'
url = 'https://storage.googleapis.com/spatialthoughts-public-data/' + filename

download(url)

Downloaded data/modis_vegetation_indices_2020.zip

Data Pre-Processing

First we unzip and extract the images to a folder.

zipfile_path = os.path.join(data_folder, filename)
with zipfile.ZipFile(zipfile_path) as zf:
  zf.extractall(data_folder)

The timestamp for the image is part of the filename. We write a function to parse the filename and extract the timestamps.

def path_to_datetimeindex(filepath):
  filename = os.path.basename(filepath)
  pattern = r'doy(\d+)'
  match = re.search(pattern, filepath)
  if match:
      doy_value = match.group(1)
      timestamp = datetime.datetime.strptime(doy_value, '%Y%j')
      return timestamp
  else:
    print('Could not extract DOY from filename', filename)


timestamps = []
filepaths = []

files = os.path.join(data_folder, 'modis_vegetation_indices_2020', '*.tif')
for filepath in glob.glob(files):
  timestamp = path_to_datetimeindex(filepath)
  filepaths.append(filepath)
  timestamps.append(timestamp)

unique_timestamps = sorted(set(timestamps))

Iterate through each file and read it using XArray. There are 3 files for each time-step. One file each for NDV, EVI and DetailedQA information. We extract the first 2-bits of information from the DetailedQA band into a new band named SummaryQA. These 4 bands are combined into a single array for each time-step and then combine all the arrays into a single array with a time dimension.

scenes = []

for timestamp in unique_timestamps:
  ndvi_filepattern = r'NDVI_doy{}'.format(timestamp.strftime('%Y%j'))
  evi_filepattern = r'EVI_doy{}'.format(timestamp.strftime('%Y%j'))
  qa_filepattern = r'VI_Quality_doy{}'.format(timestamp.strftime('%Y%j'))
  ndvi_filepath = [filepath for filepath in filepaths if re.search(ndvi_filepattern, filepath)][0]
  evi_filepath = [filepath for filepath in filepaths if re.search(evi_filepattern, filepath)][0]
  qa_filepath = [filepath for filepath in filepaths if re.search(qa_filepattern, filepath)][0]

  ndvi_band = rxr.open_rasterio(ndvi_filepath, chunks={'x':512, 'y':512})
  ndvi_band.name = 'NDVI'
  evi_band = rxr.open_rasterio(evi_filepath, chunks={'x':512, 'y':512})
  evi_band.name = 'EVI'
  qa_band = rxr.open_rasterio(qa_filepath, chunks={'x':512, 'y':512})
  qa_band.name = 'DetailedQA'

  # First 2 bits are MODLAND QA Bits with summary information
  # Extract the value of the 2-bits using bitwise AND operation
  summary_qa = qa_band & 0b11
  qa_band.name = 'SummaryQA'

  # Pixel values need to be scaled with the scale_factor
  scale_factor = 0.0001
  scaled_bands = [ndvi_band * scale_factor, evi_band * scale_factor, qa_band, summary_qa]
  scene = xr.concat(scaled_bands, dim='band')
  scenes.append(scene)

time_var = xr.Variable('time', list(unique_timestamps))

time_series_scenes = xr.concat(scenes, dim=time_var)
time_series_scenes

time_series_scenes = time_series_scenes.assign_coords(band=['NDVI', 'EVI', 'DetailedQA', 'SummaryQA'])
time_series_scenes

Cloud Masking

As described in the MODIS VI User Guide MOD13Q1/A1 QA band contains pixel values representing overall pixel quality.

Value	Summary QA	Description
-1	Fill/No Data	Not Processed
0	Good Data	Use with confidence
1	Marginal data	Useful, but look at other QA information
2	Snow/Ice	Target covered with snow/ice
3	Cloudy	Target not visible, covered with cloud

We select all pixels with values 0 or 1 and mask all pixels with snow or cloud.

summary_qa = time_series_scenes.sel(band='SummaryQA')
time_series_scenes_masked = time_series_scenes.sel(band=['NDVI', 'EVI']).where(summary_qa <= 1)
time_series_scenes_masked

Extracting the time-series

We extract the time-series at a specific X,Y coordinate.

time_series = time_series_scenes.sel(band=['NDVI', 'EVI']).interp(y=13.16, x=77.35)

This time-series has cloudy pixels which show up as anomalous NDVI and EVI values.

fig, ax = plt.subplots(1, 1)

fig.set_size_inches(15, 7)
time_series.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()

Let’s plot the cloud-masked time-series.

time_series_masked = time_series_scenes_masked.sel(band=['NDVI', 'EVI']).interp(y=13.16, x=77.35)

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)
time_series_masked.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()

This looks much better, but we have gaps in the time-series due to masked cloudy-pixels. Replace them with linearly interpolated values.

time_series_masked_interpolated = time_series_masked.chunk(dict(time=-1)).interpolate_na('time')

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)
time_series_masked_interpolated.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()

We can set the series colors to our choice of colors using Matplotlib Cycler.

from cycler import cycler
import matplotlib as mpl

colorlist = ['#006837', '#78c679']
custom_cycler = cycler(color=colorlist)

fig, ax = plt.subplots(1, 1)
ax.set_prop_cycle(custom_cycler)

fig.set_size_inches(15, 7)
time_series_masked_interpolated.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()

We can use to_pandas() method to convert the DataArray into a Pandas DataFrame. The masked pixels are NaN values that we can fill with a NoData value of -9999.

df = time_series_masked.to_pandas()
df = df.fillna(-9999)
df

Save the DataFrame as a CSV file.

output_filename = 'original_time_series.csv'
output_filepath = os.path.join(output_folder, output_filename)
df.reset_index().to_csv(output_filepath, index=False)

Exercise

Create another CSV file interpolated_time_series.csv containing the linearly interpolated NDVI and EVI values.

Processing Time-Series

Overview

Google Earth Engine (GEE) is a cloud-based platform that has a large public data catalog and computing infrastructure. The XEE extension makes it possible to obtain pre-processed data cube directly from GEE as a XArray Dataset. In this section, we will learn how to process the GEE data using XArray and Dask on local compute infrastructure using the time-series processing capabilities of XArray.

We will obtain a datacube of cloud-masked Sentinel-2 images for a year over a chosen region and apply temporal aggregation to obtain mean monthly NDVI images.

Note: You must have a Google Earth Engine account to complete this section. If you do not have one, follow our guide to sign up.

Setup and Data Download

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install --upgrade xee
    !pip install rioxarray


import ee
import xarray
import rioxarray as rxr
import matplotlib.pyplot as plt
import pandas as pd
import os
import datetime
import numpy as np

output_folder = 'output'

if not os.path.exists(output_folder):
    os.mkdir(output_folder)

We initialize Google Earth Engine API.You must have a Google Cloud Project associated with your GEE account. Replace the cloud_project with your own project from Google Cloud Console. Here we are using the High Volume Endpoint which is recommended when working with XEE.

cloud_project = 'spatialthoughts'

try:
    ee.Initialize(project=cloud_project, opt_url='https://earthengine-highvolume.googleapis.com')
except:
    ee.Authenticate()
    ee.Initialize(project=cloud_project, opt_url='https://earthengine-highvolume.googleapis.com')

Data Pre-Processing

We start with the Sentinel-2 collection, apply a cloud mask and compute NDVI using the Google Earth Engine API.

s2 = ee.ImageCollection('COPERNICUS/S2_HARMONIZED')
geometry = ee.Geometry.Polygon([[
  [82.60642647743225, 27.16350437805251],
  [82.60984897613525, 27.1618529901377],
  [82.61088967323303, 27.163695288375266],
  [82.60757446289062, 27.16517483230927]
]])

filtered = s2 \
  .filter(ee.Filter.date('2019-01-01', '2020-01-01')) \
  .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', 30)) \
  .filter(ee.Filter.bounds(geometry))

# Write a function for Cloud masking
def maskS2clouds(image):
  qa = image.select('QA60')
  cloudBitMask = 1 << 10
  cirrusBitMask = 1 << 11
  mask = qa.bitwiseAnd(cloudBitMask).eq(0).And(
             qa.bitwiseAnd(cirrusBitMask).eq(0))
  return image.updateMask(mask).multiply(0.0001) \
      .select('B.*') \
      .copyProperties(image, ['system:time_start'])

filtered = filtered.map(maskS2clouds)
# Write a function that computes NDVI for an image and adds it as a band
def addNDVI(image):
  ndvi = image.normalizedDifference(['B8', 'B4']).rename('ndvi')
  return image.addBands(ndvi)

# Map the function over the collection
withNdvi = filtered.map(addNDVI)

Now we have an ImageCollection that we want to get it as a XArray Dataset. We define the region of interest and extract the ImageCollection using the ‘ee’ engine.

scale in XEE behaves differently than in GEE API. The value of the scale is in the units of the CRS. So to get the data at 10 meters resolution, we must also specify a CRS that has the unit of meters.

ds = xarray.open_dataset(
    withNdvi.select('ndvi'),
    engine='ee',
    crs='EPSG:3857',
    scale=10,
    geometry=geometry,
)

ds

Since we want to work with coordinates in latitude and longitude, we reproject the raster to EPSG:4326. This requires renaming and reordering the dimentions to those expected by rioxarray.

ds_reprojected = ds\
  .rename({'Y': 'y', 'X': 'x'}) \
  .transpose('time', 'y', 'x') \
  .rio.reproject('EPSG:4326')
ds_reprojected

Select the ndvi variable. Split the datacube into ‘chunks’ to allow parallel processing using Dask.

original_time_series = ds_reprojected.ndvi.chunk('auto')
original_time_series

Aggregating the Time-Series

We have a irregular time-series with a lot of noise. Let’s create a monthly NDVI time-series by computing the monthly average NDVI. We can use XArray’s resample() method to aggregate the time-series by Months and coompute the mean. We specify time as MS to indicate aggregating by month start and label as left to indicate we want the start of the month as our series labels.

Reference Pandas Offset Aliases

time_series_aggregated = original_time_series.resample(time='MS', label='left').mean()
time_series_aggregated

Plot and Extract the Time-Series at a Single Location

original_ts = original_time_series.interp(x=82.607376, y=27.164335).compute()
aggregated_ts = time_series_aggregated.interp(x=82.607376, y=27.164335).compute()

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)
original_ts.plot.line(
    ax=ax, x='time',
    marker='^', color='#66c2a4', linestyle='--', linewidth=1, markersize=2)
aggregated_ts.plot.line(
    ax=ax, x='time',
    marker='o', color='#238b45', linestyle='-', linewidth=1, markersize=4)

ax.set_title('Original vs. Monthly NDVI Time-Series', size = 18)

plt.show()

Downloading Time-Series Images

Save the processed monthly NDVI images using rioxarray as GeoTIFF files.

for time in time_series_aggregated.time.values:
  image = time_series_aggregated.sel(time=time)
  date = np.datetime_as_string(time, unit='M')
  output_file = f'{date}.tif'
  output_path = os.path.join(output_folder, output_file)
  image.rio.to_raster(output_path, driver='COG')
  print(f'Created file at {output_path}')

Exercise

Aggregate the time-series to 15-day interval. Apply gap-filling to fill missing values.

Hint: Refer to Reference Pandas Offset Aliases for appropriate string value for the time parameter.

2. Scaling Computation with Dask

Dask is a python library to run your computation in parallel across many machines. Dask has built-in support for key geospatial packages like XArray and Pandas allowing you to scale your computation easily. You can choose to run your code in parallel on your laptop, a machine in the cloud, local or cloud cluster of machines etc.

Calculating Zonal Statistics

Overview

Zonal Statistics is used to summarises the values of a raster dataset within the zones of a vector dataset. Here we select all Admin1 units of a country and calculate a sum of nighttime light pixel intensities over multiple years. This is a large computation that is enabled by cloud-hosted NightTime Lights data in the Cloud-Optimized GeoTIFF (COG) format and parallel computing on a local dask cluster.

Setup and Data Download

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !apt install libspatialindex-dev
    !pip install fiona shapely pyproj rtree
    !pip install geopandas
    !pip install rioxarray
    !pip install regionmask

import os
import glob
import pandas as pd
import geopandas as gpd
import numpy as np
import xarray as xr
import rioxarray as rxr
import matplotlib.pyplot as plt
from datetime import datetime
import dask
import regionmask

import warnings
import rasterio
warnings.filterwarnings("ignore", category=rasterio.RasterioDeprecationWarning)

from dask.distributed import Client, progress
client = Client()  # set up local cluster on the machine
client

dask.config.set({"array.slicing.split_large_chunks": False})
dask.config.set({'array.chunk-size': '32MiB'})

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

admin1_zipfile = 'ne_10m_admin_1_states_provinces.zip'
admin1_url = 'https://naciscdn.org/naturalearth/10m/cultural/'

download(admin1_url + admin1_zipfile)

Data Pre-Processing

First we will read the GDL shapefile and filter to a country. The ‘adm1_code’ column contains a unique id for all the counties present in the state, but it is of object type. We need to convert it to int type to be used in xarray.

country_code = 'LK'
admin1_file_path = os.path.join(data_folder, admin1_zipfile)

admin1_df = gpd.read_file(admin1_file_path)

zones = admin1_df[admin1_df['iso_a2'] == country_code][['adm1_code', 'name', 'iso_a2', 'geometry']].copy()
zones['id'] = zones.reset_index().index + 1
zones

bbox = zones.total_bounds
bbox

Next we read the NTL files and create an XArray object.

These files were download from Harvard Dataverse and converted to Cloud-Optimized GeoTIFFs using GDAL.

Example command

gdalwarp -of COG 2021_HasMask/LongNTL_2021.tif 2021.tif \
 -te -180 -90 180 90 -dstnodata 0 \
 -co COMPRESS=DEFLATE -co PREDICTOR=2 -co NUM_THREADS=ALL_CPUS

The resulting files are now hosted on a Google Cloud Storage bucket.

start_year = 2015
end_year = 2020


ntl_folder = 'https://storage.googleapis.com/spatialthoughts-public-data/ntl/npp_viirs_ntl'

da_list = []

for year in range(start_year, end_year + 1):
  cog_url = f'{ntl_folder}/{year}.tif'
  da = rxr.open_rasterio(
      cog_url,
      chunks=True).rio.clip_box(*bbox)
  dt = pd.to_datetime(year, format='%Y')
  da = da.assign_coords(time = dt)
  da = da.expand_dims(dim="time")
  da_list.append(da)

ntl_datacube = xr.concat(da_list, dim='time').chunk('auto')
ntl_datacube

ntl_datacube = ntl_datacube.sel(band=1, drop=True)

Zonal Stats

Now we will extract the sum of the raster pixel values for every admin1 region in the selected countrey.

First, we need to convert the GeoDataFrame to a XArray Dataset. We will be using the regionmask module for that. It takes geodataframe, it’s unique value as integer and converts the geodataframe into a xarray dataset having dimension and coordinates same as of given input xarray dataset

# Create mask of multiple regions from shapefile
mask = regionmask.mask_3D_geopandas(
        zones,
        ntl_datacube.x,
        ntl_datacube.y,
        drop=True,
        numbers="id",
        overlap=True
    )

ntl_datacube = ntl_datacube.where(mask).chunk('auto')
ntl_datacube

grouped = ntl_datacube.groupby('time').sum(['x','y'])
grouped

%time grouped = grouped.compute()

stats = grouped.drop('spatial_ref').to_dataframe('sum').reset_index()
stats

stats['year'] = stats.time.dt.year
stats = stats.rename(columns={'region': 'id'})
stats

results = zones.merge(stats, on='id')
results

Finally, we save the result to disk.

output = results[['adm1_code', 'name', 'iso_a2', 'year', 'sum']]
output

output_file = 'output.csv'
output_path = os.path.join(output_folder, output_file)

output.to_csv(output_path, index=False)
print('Successfully written output file at {}'.format(output_path))

Exercise

Change the code to calculate the Zonal Statistics for all admin1 units in your chosen country.

License

This course material is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to re-use and adapt the material but are required to give appropriate credit to the original author as below:

Cloud-basd Remote Sensing with Python Course by Ujaval Gandhi www.spatialthoughts.com

This course is offered as an instructor-led online class. Visit Spatial Thoughts to know details of upcoming sessions.

If you want to report any issues with this page, please comment below.

Cloud-based Remote Sensing with Python (Full Course)

A guide to working with Earth Observation datasets using XArray, Dask and STAC.

Ujaval Gandhi

Introduction

1. Learning XArray

Computing Indices

Overview

Setup and Data Download

Data Pre-Processing

Visualization

Calculating Indices

Saving the results

Exercise

Cloud Masking

Overview

Setup and Data Download

Data Pre-Processing

Visualization

Applying a Cloud Mask

Extracting Time Series

Overview

Setup and Data Download

Data Pre-Processing

Cloud Masking

Extracting the time-series

Exercise

Processing Time-Series

Overview

Setup and Data Download

Data Pre-Processing

Aggregating the Time-Series

Downloading Time-Series Images

Exercise

2. Scaling Computation with Dask

Calculating Zonal Statistics

Overview

Setup and Data Download

Data Pre-Processing

Zonal Stats

Exercise

License