Note: This course is under active development and not complete yet!
This is an advanced-level course that teaches you how to use open-source Python libraries to process earth observation dataset using cloud and parallel-computing technologies. We have designed this course to help you build the required skills in a structured way and offer flexibility in where your data is stored and how it is processed.
XArray
has emerged as one of the key Python libraries to work with gridded
raster datasets. It can natively handle time-series data making it ideal
for working with Remote Sensing datasets. It builds on NumPy/Pandas for
fast arrays/indexing and is orders of magnitude faster than other Python
libraries like rasterio
. It has a growing ecosystem of
extensions rioxarray
, xarray-spatial
,
XEE
and more allowing it to be used for geospatial
analysis. In this section, we will learn the fundamentals of XArray with
a focus on earth observation data processing. XArray offers the
flexibility to seamlessly work with local datasets along with
cloud-hosted datasets in a variety of optimized data formats.
In this section, we will take a single Sentinel-2 L1C scene downloaded from the Copernicus Browser and learn how to read it using XArray, visualize it and compute spectral indices.
The following blocks of code will install the required packages and download the datasets to your Colab environment.
%%capture
if 'google.colab' in str(get_ipython()):
!pip install rioxarray
!pip install jupyter-server-proxy
import os
import matplotlib.pyplot as plt
import xarray as xr
import rioxarray as rxr
import zipfile
import geopandas as gpd
data_folder = 'data'
output_folder = 'output'
if not os.path.exists(data_folder):
os.mkdir(data_folder)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
def download(url):
filename = os.path.join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
s2_scene = 'S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE.zip'
data_url = 'https://storage.googleapis.com/spatialthoughts-public-data/'
download(data_url + 's2/' + s2_scene)
aoi = 'bangalore.geojson'
download(data_url + aoi)
We first unzip the zip archive and create a XArray Dataset from the individual band images.
s2_filepath = os.path.join(data_folder, s2_scene)
with zipfile.ZipFile(s2_filepath) as zf:
zf.extractall(data_folder)
Sentinel-2 images come as individual JPEG2000 rasters for each band.
The image files are located in the
GRANULE/{SCENE_ID}/IMG_DATA/
subfolder. We find the files
and read them using rioxarray
.
import glob
s2_folder = s2_filepath[:-4]
band_files = {}
for filepath in glob.glob(
os.path.join(s2_folder, 'GRANULE', '*', 'IMG_DATA', '*B*.jp2')):
filename = os.path.basename(filepath)
# Extract the part of the filename containing band name such as 'B01'
band_name = os.path.splitext(filename)[0].split('_')[-1]
band_files[band_name] = filepath
band_files
{'B05': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B05.jp2',
'B8A': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B8A.jp2',
'B07': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B07.jp2',
'B01': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B01.jp2',
'B12': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B12.jp2',
'B10': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B10.jp2',
'B08': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B08.jp2',
'B06': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B06.jp2',
'B04': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B04.jp2',
'B09': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B09.jp2',
'B02': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B02.jp2',
'B11': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B11.jp2',
'B03': 'data/S2A_MSIL1C_20230212T050931_N0509_R019_T43PGQ_20230212T065641.SAFE/GRANULE/L1C_T43PGQ_A039913_20230212T051514/IMG_DATA/T43PGQ_20230212T050931_B03.jp2'}
Different Sentinel-2 bands have different spatial resolutions. To put them in the same array, their dimensions must match. So we combine bands of similar resolutions and resample others to match them.
b4 = rxr.open_rasterio(python-remote-sensing-output/band_files['B04'], chunks=True)
b3 = rxr.open_rasterio(python-remote-sensing-output/band_files['B03'], chunks=True)
b2 = rxr.open_rasterio(python-remote-sensing-output/band_files['B02'], chunks=True)
b8 = rxr.open_rasterio(python-remote-sensing-output/band_files['B08'], chunks=True)
stack1 = xr.concat([b4, b3, b2, b8], dim='band').assign_coords(
band=['B04', 'B03', 'B02', 'B08'])
b5 = rxr.open_rasterio(python-remote-sensing-output/band_files['B05'], chunks=True)
b6 = rxr.open_rasterio(python-remote-sensing-output/band_files['B07'], chunks=True)
b7= rxr.open_rasterio(python-remote-sensing-output/band_files['B07'], chunks=True)
b8a = rxr.open_rasterio(python-remote-sensing-output/band_files['B8A'], chunks=True)
b11 = rxr.open_rasterio(python-remote-sensing-output/band_files['B11'], chunks=True)
b12 = rxr.open_rasterio(python-remote-sensing-output/band_files['B12'], chunks=True)
stack2 = xr.concat([b5, b6, b7, b8a, b11, b12], dim='band').assign_coords(
band=['B05', 'B06', 'B07', 'B8A', 'B11', 'B12'])
Now we reproject the bands to match the resolution of the 10m bands
using reproject_match()
Both band stacks now have the same resolution and hence the same X
and Y dimensions. We can now combine them across the band
dimension. The Sentinel-2 scenes come with NoData value of 0. So we set
the correct NoData value before further processing.
We will clip the scene to the geometry of the GeoJSON file. We must ensure that the projection of the GeoDataFrame and the XArray DataArray match.
aoi_path = os.path.join(data_folder, aoi)
aoi_gdf = gpd.read_file(aoi_path)
geometry = aoi_gdf.to_crs(scene.rio.crs).geometry
clipped = scene.rio.clip(geometry)
clipped
Sentinel-2 pixel values need to be converted to reflectances using the following formula:
ρ = (L1C_DN + RADIO_ADD_OFFSET) / QUANTIFICATION_VALUE
QUANTIFICATION_VALUE
for all S2 scenes is
10000 and starting from January 22,2023, all new scenes
have a RADIO_ADD_OFFSET
of -1000
We can visualize a 3-band composite image.
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
scaled.sel(band=['B04', 'B03', 'B02']).plot.imshow(
ax=ax,
robust=True)
ax.set_title('RGB Visualization')
ax.set_axis_off()
plt.show()
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
scaled.sel(band=['B08', 'B04', 'B03']).plot.imshow(
ax=ax,
robust=True)
ax.set_title('NRG Visualization')
ax.set_axis_off()
plt.show()
Let’s visualize the results.
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
ndvi.plot.imshow(
ax=ax,
cmap='Greens',
robust=True,
cbar_kwargs=cbar_kwargs)
ax.set_title('NDVI')
ax.set_axis_off()
plt.show()
green = scaled.sel(band='B03')
swir1 = scaled.sel(band='B11')
mndwi = (green - swir1)/(green + swir1)
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
mndwi.plot.imshow(
ax=ax,
cmap='Blues',
robust=True,
cbar_kwargs=cbar_kwargs)
ax.set_title('MNDWI')
ax.set_axis_off()
plt.show()
/usr/local/lib/python3.10/dist-packages/distributed/client.py:3160: UserWarning: Sending large graph of size 328.52 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
red = scaled.sel(band='B04')
nir = scaled.sel(band='B08')
savi = 1.5 * ((nir - red) / (nir + red + 0.5))
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
savi.plot.imshow(
ax=ax,
cmap='Greens',
robust=True,
cbar_kwargs=cbar_kwargs)
ax.set_title('SAVI')
ax.set_axis_off()
plt.show()
Calculate the Normalized Difference Built-Up Index (NDBI) for the image and visualize the built-up area using a ‘red’ palette
Hint: NDBI = (SWIR1 – NIR) / (SWIR1 + NIR)
In this section, we will take a single Landsat 8 scene downloaded from the USGS EarthExplorer and learn how to read it using XArray, visualize it and mask clouds.
The following blocks of code will install the required packages and download the datasets to your Colab environment.
%%capture
if 'google.colab' in str(get_ipython()):
!pip install rioxarray
!pip install jupyter-server-proxy
import os
import matplotlib.pyplot as plt
import xarray as xr
import rioxarray as rxr
import zipfile
import numpy as np
data_folder = 'data'
output_folder = 'output'
if not os.path.exists(data_folder):
os.mkdir(data_folder)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
def download(url):
filename = os.path.join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
landsat_scene = 'LC08_L2SP_144052_20190209_20200829_02_T1.zip'
data_url = 'https://storage.googleapis.com/spatialthoughts-public-data/'
download(data_url + 'landsat/' + landsat_scene)
We first unzip the zip archive and create a XArray Dataset from the individual band images.
landsat_filepath = os.path.join(data_folder, landsat_scene)
with zipfile.ZipFile(landsat_filepath) as zf:
zf.extractall(data_folder)
Sentinel-2 images come as individual GeoTIFF rasters for each band.
The image band files cotaining data from the OLI sensor are named such
as *_SR_B1.TIF
, *_SR_B2.TIF
etc. We find the
files and read them using rioxarray
.
import glob
band_files = {}
for filepath in glob.glob(
os.path.join(data_folder, '*SR_B*.TIF')):
filename = os.path.basename(filepath)
# Extract the part of the filename containing band name such as 'B01'
band_name = os.path.splitext(filename)[0].split('_')[-1]
band_files[band_name] = filepath
band_files
We read the bands using rioxarray
and combine them into
a single DataArray.
bands = []
for band_name, band_path in band_files.items():
ds = rxr.open_rasterio(band_path, chunks=True, masked=True)
ds.name = band_name
bands.append(ds)
scene = xr.concat(bands, dim='band').assign_coords(
band=list(python-remote-sensing-output/band_files.keys()))
scene
Landsat Collection 2 surface reflectance has a scale factor of
0.0000275
and an additional offset of -0.2
per
pixel. We apply the scale and offset to get the reflectance values
We can visualize a 3-band composite image.
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
scaled.sel(band=['B4', 'B3', 'B2']).plot.imshow(
ax=ax,
robust=True)
ax.set_title('RGB Visualization')
ax.set_axis_off()
plt.show()
Most satellite providers store pixel quality information as bitmasks. This allows the providers to store a lot more information in an efficient way. Bitmasks can be hard to understand. Please read our post on Working with QA Bands and Bitmasks in Google Earth Engine to understand the concepts. Here we apply the same concept using a Python implementation of bitmasking adapted from Google Earth Engine.
Read the QA band which comes with a filename ending with
*QA_PIXEL.TIF
.
for qa_filepath in glob.glob(
os.path.join(data_folder, '*QA_PIXEL.TIF')):
break
pixel_qa = rxr.open_rasterio(qa_filepath, chunks=True, masked=True)
pixel_qa.name = 'pixel_qa'
pixel_qa
The 16-bit pixel value is determined by setting each of the bits to 0 or 1.
Since we are interested in clouds, the first 5 bits are relevant
Bit | Description |
---|---|
0 | Fill |
1 | Dilated Cloud |
2 | Cirrus |
3 | Cloud |
4 | Cloud Shadow |
We create a mask using the np.bitwise_and()
function
against the binary number 11111
. The result will be 0 only
if all the bits are set to 0 indicating clear conditions.
mask = xr.full_like(pixel_qa, int('11111', 2))
qa_mask = np.bitwise_and(pixel_qa, mask) == (0)
qa_mask
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
scene_masked.sel(band=['B4', 'B3', 'B2']).plot.imshow(
ax=ax,
robust=True)
ax.set_title('RGB Visualization')
ax.set_axis_off()
plt.show()
Let’s write the resulting masked as a multi-band GeoTIFF file using rioxarray. We convert the DataArray to a DataSet with each band being a separate variable. This results in a much nicer GeoTIFF that is compatible with desktop software.
In this section, we will work with MODIS Vegetation Indices Version 6.1 data which are generated every 16 days at 250 meter (m) spatial resolution. This dataset provides Normalised Differnece Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) processed from the MODIS data. The source data was downloaded and processed via AρρEEARS which allows for pre-processing of the data into user-specified AOIs.
Here we will working with GeoTIFF files in EPSG:4326 projection and clipped to the state boundary of Karnataka, India. We will learn how to load all the files into a single XArray Dataset, apply a cloud-mask and extract a time-series of NDVI and EVI values.
The following blocks of code will install the required packages and download the datasets to your Colab environment.
%%capture
if 'google.colab' in str(get_ipython()):
!apt install libspatialindex-dev
!pip install fiona shapely pyproj rtree
!pip install geopandas
!pip install rioxarray
import datetime
import glob
import os
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import xarray as xr
import rioxarray as rxr
import zipfile
data_folder = 'data'
output_folder = 'output'
if not os.path.exists(data_folder):
os.mkdir(data_folder)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
def download(url):
filename = os.path.join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
filename = 'modis_vegetation_indices_2020.zip'
url = 'https://storage.googleapis.com/spatialthoughts-public-data/' + filename
download(url)
Downloaded data/modis_vegetation_indices_2020.zip
First we unzip and extract the images to a folder.
zipfile_path = os.path.join(data_folder, filename)
with zipfile.ZipFile(zipfile_path) as zf:
zf.extractall(data_folder)
The timestamp for the image is part of the filename. We write a function to parse the filename and extract the timestamps.
def path_to_datetimeindex(filepath):
filename = os.path.basename(filepath)
pattern = r'doy(\d+)'
match = re.search(pattern, filepath)
if match:
doy_value = match.group(1)
timestamp = datetime.datetime.strptime(doy_value, '%Y%j')
return timestamp
else:
print('Could not extract DOY from filename', filename)
timestamps = []
filepaths = []
files = os.path.join(data_folder, 'modis_vegetation_indices_2020', '*.tif')
for filepath in glob.glob(files):
timestamp = path_to_datetimeindex(filepath)
filepaths.append(filepath)
timestamps.append(timestamp)
unique_timestamps = sorted(set(timestamps))
Iterate through each file and read it using XArray. There are 3 files
for each time-step. One file each for NDV
, EVI
and DetailedQA
information. We extract the first 2-bits of
information from the DetailedQA
band into a new band named
SummaryQA
. These 4 bands are combined into a single array
for each time-step and then combine all the arrays into a single array
with a time
dimension.
scenes = []
for timestamp in unique_timestamps:
ndvi_filepattern = r'NDVI_doy{}'.format(timestamp.strftime('%Y%j'))
evi_filepattern = r'EVI_doy{}'.format(timestamp.strftime('%Y%j'))
qa_filepattern = r'VI_Quality_doy{}'.format(timestamp.strftime('%Y%j'))
ndvi_filepath = [filepath for filepath in filepaths if re.search(ndvi_filepattern, filepath)][0]
evi_filepath = [filepath for filepath in filepaths if re.search(evi_filepattern, filepath)][0]
qa_filepath = [filepath for filepath in filepaths if re.search(qa_filepattern, filepath)][0]
ndvi_band = rxr.open_rasterio(ndvi_filepath, chunks={'x':512, 'y':512})
ndvi_band.name = 'NDVI'
evi_band = rxr.open_rasterio(evi_filepath, chunks={'x':512, 'y':512})
evi_band.name = 'EVI'
qa_band = rxr.open_rasterio(qa_filepath, chunks={'x':512, 'y':512})
qa_band.name = 'DetailedQA'
# First 2 bits are MODLAND QA Bits with summary information
# Extract the value of the 2-bits using bitwise AND operation
summary_qa = qa_band & 0b11
qa_band.name = 'SummaryQA'
# Pixel values need to be scaled with the scale_factor
scale_factor = 0.0001
scaled_bands = [ndvi_band * scale_factor, evi_band * scale_factor, qa_band, summary_qa]
scene = xr.concat(scaled_bands, dim='band')
scenes.append(scene)
As described in the MODIS VI User Guide MOD13Q1/A1 QA band contains pixel values representing overall pixel quality.
Value | Summary QA | Description |
---|---|---|
-1 | Fill/No Data | Not Processed |
0 | Good Data | Use with confidence |
1 | Marginal data | Useful, but look at other QA information |
2 | Snow/Ice | Target covered with snow/ice |
3 | Cloudy | Target not visible, covered with cloud |
We select all pixels with values 0 or 1 and mask all pixels with snow or cloud.
We extract the time-series at a specific X,Y coordinate.
This time-series has cloudy pixels which show up as anomalous NDVI and EVI values.
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)
time_series.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()
Let’s plot the cloud-masked time-series.
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)
time_series_masked.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()
This looks much better, but we have gaps in the time-series due to masked cloudy-pixels. Replace them with linearly interpolated values.
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)
time_series_masked_interpolated.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()
We can set the series colors to our choice of colors using Matplotlib Cycler.
from cycler import cycler
import matplotlib as mpl
colorlist = ['#006837', '#78c679']
custom_cycler = cycler(color=colorlist)
fig, ax = plt.subplots(1, 1)
ax.set_prop_cycle(custom_cycler)
fig.set_size_inches(15, 7)
time_series_masked_interpolated.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()
We can use to_pandas()
method to convert the DataArray
into a Pandas DataFrame. The masked pixels are NaN values that we can
fill with a NoData value of -9999.
Save the DataFrame as a CSV file.
Create another CSV file interpolated_time_series.csv
containing the linearly interpolated NDVI and EVI values.
Google Earth Engine (GEE) is a cloud-based platform that has a large public data catalog and computing infrastructure. The XEE extension makes it possible to obtain pre-processed data cube directly from GEE as a XArray Dataset. In this section, we will learn how to process the GEE data using XArray and Dask on local compute infrastructure using the time-series processing capabilities of XArray.
We will obtain a datacube of cloud-masked Sentinel-2 images for a year over a chosen region and apply temporal aggregation to obtain mean monthly NDVI images.
Note: You must have a Google Earth Engine account to complete this section. If you do not have one, follow our guide to sign up.
The following blocks of code will install the required packages and download the datasets to your Colab environment.
%%capture
if 'google.colab' in str(get_ipython()):
!pip install --upgrade xee
!pip install rioxarray
import ee
import xarray
import rioxarray as rxr
import matplotlib.pyplot as plt
import pandas as pd
import os
import datetime
import numpy as np
We initialize Google Earth Engine API.You must have a Google Cloud
Project associated with your GEE account. Replace the
cloud_project
with your own project from Google Cloud Console. Here
we are using the High
Volume Endpoint which is recommended when working with XEE.
We start with the Sentinel-2 collection, apply a cloud mask and compute NDVI using the Google Earth Engine API.
s2 = ee.ImageCollection('COPERNICUS/S2_HARMONIZED')
geometry = ee.Geometry.Polygon([[
[82.60642647743225, 27.16350437805251],
[82.60984897613525, 27.1618529901377],
[82.61088967323303, 27.163695288375266],
[82.60757446289062, 27.16517483230927]
]])
filtered = s2 \
.filter(ee.Filter.date('2019-01-01', '2020-01-01')) \
.filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', 30)) \
.filter(ee.Filter.bounds(geometry))
# Write a function for Cloud masking
def maskS2clouds(image):
qa = image.select('QA60')
cloudBitMask = 1 << 10
cirrusBitMask = 1 << 11
mask = qa.bitwiseAnd(cloudBitMask).eq(0).And(
qa.bitwiseAnd(cirrusBitMask).eq(0))
return image.updateMask(mask).multiply(0.0001) \
.select('B.*') \
.copyProperties(image, ['system:time_start'])
filtered = filtered.map(maskS2clouds)
# Write a function that computes NDVI for an image and adds it as a band
def addNDVI(image):
ndvi = image.normalizedDifference(['B8', 'B4']).rename('ndvi')
return image.addBands(ndvi)
# Map the function over the collection
withNdvi = filtered.map(addNDVI)
Now we have an ImageCollection that we want to get it as a XArray Dataset. We define the region of interest and extract the ImageCollection using the ‘ee’ engine.
scale
in XEE behaves differently than in GEE API. The
value of the scale
is in the units of the CRS. So to get
the data at 10 meters resolution, we must also specify a CRS that has
the unit of meters.
ds = xarray.open_dataset(
withNdvi.select('ndvi'),
engine='ee',
crs='EPSG:3857',
scale=10,
geometry=geometry,
)
Since we want to work with coordinates in latitude and longitude, we
reproject the raster to EPSG:4326
. This requires renaming
and reordering the dimentions to those expected by rioxarray.
ds_reprojected = ds\
.rename({'Y': 'y', 'X': 'x'}) \
.transpose('time', 'y', 'x') \
.rio.reproject('EPSG:4326')
ds_reprojected
Select the ndvi
variable. Split the datacube into
‘chunks’ to allow parallel processing using Dask.
We have a irregular time-series with a lot of noise. Let’s create a
monthly NDVI time-series by computing the monthly average NDVI. We can
use XArray’s resample()
method to aggregate the time-series
by Months and coompute the mean. We specify time as
MS
to indicate aggregating by month start and
label as left
to indicate we want the start of the
month as our series labels.
Reference Pandas Offset Aliases
time_series_aggregated = original_time_series.resample(time='MS', label='left').mean()
time_series_aggregated
Plot and Extract the Time-Series at a Single Location
original_ts = original_time_series.interp(x=82.607376, y=27.164335).compute()
aggregated_ts = time_series_aggregated.interp(x=82.607376, y=27.164335).compute()
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)
original_ts.plot.line(
ax=ax, x='time',
marker='^', color='#66c2a4', linestyle='--', linewidth=1, markersize=2)
aggregated_ts.plot.line(
ax=ax, x='time',
marker='o', color='#238b45', linestyle='-', linewidth=1, markersize=4)
ax.set_title('Original vs. Monthly NDVI Time-Series', size = 18)
plt.show()
Save the processed monthly NDVI images using rioxarray
as GeoTIFF files.
for time in time_series_aggregated.time.values:
image = time_series_aggregated.sel(time=time)
date = np.datetime_as_string(time, unit='M')
output_file = f'{date}.tif'
output_path = os.path.join(output_folder, output_file)
image.rio.to_raster(output_path, driver='COG')
print(f'Created file at {output_path}')
Aggregate the time-series to 15-day interval. Apply gap-filling to fill missing values.
Hint: Refer to Reference Pandas Offset Aliases for appropriate string value for the time parameter.
Dask
is a python
library to run your computation in parallel across many machines. Dask
has built-in support for key geospatial packages like XArray and Pandas
allowing you to scale your computation easily. You can choose to run
your code in parallel on your laptop, a machine in the cloud, local or
cloud cluster of machines etc.
Zonal Statistics is used to summarises the values of a raster dataset within the zones of a vector dataset. Here we select all Admin1 units of a country and calculate a sum of nighttime light pixel intensities over multiple years. This is a large computation that is enabled by cloud-hosted NightTime Lights data in the Cloud-Optimized GeoTIFF (COG) format and parallel computing on a local dask cluster.
The following blocks of code will install the required packages and download the datasets to your Colab environment.
%%capture
if 'google.colab' in str(get_ipython()):
!apt install libspatialindex-dev
!pip install fiona shapely pyproj rtree
!pip install geopandas
!pip install rioxarray
!pip install regionmask
import os
import glob
import pandas as pd
import geopandas as gpd
import numpy as np
import xarray as xr
import rioxarray as rxr
import matplotlib.pyplot as plt
from datetime import datetime
import dask
import regionmask
import warnings
import rasterio
warnings.filterwarnings("ignore", category=rasterio.RasterioDeprecationWarning)
from dask.distributed import Client, progress
client = Client() # set up local cluster on the machine
client
dask.config.set({"array.slicing.split_large_chunks": False})
dask.config.set({'array.chunk-size': '32MiB'})
data_folder = 'data'
output_folder = 'output'
if not os.path.exists(data_folder):
os.mkdir(data_folder)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
def download(url):
filename = os.path.join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
admin1_zipfile = 'ne_10m_admin_1_states_provinces.zip'
admin1_url = 'https://naciscdn.org/naturalearth/10m/cultural/'
download(admin1_url + admin1_zipfile)
First we will read the GDL shapefile and filter to a country. The
‘adm1_code’ column contains a unique id for all the counties present in
the state, but it is of object
type. We need to convert it
to int
type to be used in xarray.
country_code = 'LK'
admin1_file_path = os.path.join(data_folder, admin1_zipfile)
admin1_df = gpd.read_file(admin1_file_path)
zones = admin1_df[admin1_df['iso_a2'] == country_code][['adm1_code', 'name', 'iso_a2', 'geometry']].copy()
zones['id'] = zones.reset_index().index + 1
zones
Next we read the NTL files and create an XArray object.
These files were download from Harvard Dataverse and converted to Cloud-Optimized GeoTIFFs using GDAL.
Example command
gdalwarp -of COG 2021_HasMask/LongNTL_2021.tif 2021.tif \
-te -180 -90 180 90 -dstnodata 0 \
-co COMPRESS=DEFLATE -co PREDICTOR=2 -co NUM_THREADS=ALL_CPUS
The resulting files are now hosted on a Google Cloud Storage bucket.
ntl_folder = 'https://storage.googleapis.com/spatialthoughts-public-data/ntl/npp_viirs_ntl'
da_list = []
for year in range(start_year, end_year + 1):
cog_url = f'{ntl_folder}/{year}.tif'
da = rxr.open_rasterio(
cog_url,
chunks=True).rio.clip_box(*bbox)
dt = pd.to_datetime(year, format='%Y')
da = da.assign_coords(time = dt)
da = da.expand_dims(dim="time")
da_list.append(da)
Now we will extract the sum of the raster pixel values for every admin1 region in the selected countrey.
First, we need to convert the GeoDataFrame to a XArray Dataset. We
will be using the regionmask
module for that. It takes
geodataframe, it’s unique value as integer and converts the geodataframe
into a xarray dataset
having dimension and coordinates same
as of given input xarray dataset
# Create mask of multiple regions from shapefile
mask = regionmask.mask_3D_geopandas(
zones,
ntl_datacube.x,
ntl_datacube.y,
drop=True,
numbers="id",
overlap=True
)
Finally, we save the result to disk.
Change the code to calculate the Zonal Statistics for all admin1 units in your chosen country.
We have many high-quality global climate datasets that has historic data on various climatic variables. Using cloud-hosted datasets and XArray, we can compute pixel-wise long-term trends. This analysis helps us identify hotspots experienceing extreme climate change.
We will be using the TerraClimate gridded dataset of monthly climate and climatic water balance at high spatial resolution (~4km grid size) with a long time-series. (1958-current). This is a large dataset that is hosted on a THREDDS Data Server (TDS) and served using the OPeNDAP (Open Data Access Protocol) protocol. XArray has built-in support to efficiently read and process OPeNDAP data where we can stream and process only the required pixels without downloading entire dataset.
This notebook shows how to extract a time-series of monthly maximum temperatures and compute per-pixel linear trend. The results are processed in a distributed matter using Dask and the results are saved as a GeoTIFF file.
%%capture
if 'google.colab' in str(get_ipython()):
!pip install geopandas rioxarray cartopy dask[distributed] netCDF4
GeoBondaries is an open databse of political administrative boundaries at different administrative levels. We download the Admin1 Boundaries.
def download(url):
filename = os.path.join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print(f'Downloaded {filename}')
else:
print(f'File {filename} exists.')
admin_boundaries = 'geoBoundariesCGAZ_ADM1.gpkg'
geoboundaries_url = 'https://github.com/wmgeolab/geoBoundaries/raw/main/releaseData/CGAZ/'
download(geoboundaries_url + admin_boundaries)
admin_boundaries_filepath = os.path.join(data_folder, admin_boundaries)
admin_gdf = gpd.read_file(admin_boundaries_filepath, encoding='utf-8')
Select the country by its 3-letter ISO code. Here we use the code GHA for Ghana.
Select a region. We select the Greater Accra Region.
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
region_gdf.plot(
ax=ax,
edgecolor='#969696',
facecolor='none',
alpha=0.5)
ax.set_axis_off()
plt.show()
Get the bounding box to filter the XArray dataset.
Setup a Local Dask Cluster.
We want to analyze the long-term yearly trend of monthly maximum temperatures. Since the input data comes as monthly values, we first aggregate the data to yearly time steps.
terraclimate_url = 'http://thredds.northwestknowledge.net:8080/thredds/dodsC/'
variable = 'tmax'
filename = f'agg_terraclimate_{variable}_1958_CurrentYear_GLOBE.nc'
remote_file_path = os.path.join(terraclimate_url, filename)
ds = xr.open_dataset(
remote_file_path,
chunks='auto',
engine='netcdf4',
)
ds
# Select the variable
da = ds.tmax
# Select data within the bounding box
# Make sure the data is sorted for slice() to work correctly
da = da.sortby([da.lon, da.lat])
# Add 0.1 degree buffer
da_subset = da.sel(
lon=slice(lon_min - 0.1, lon_max + 0.1),
lat=slice(lat_min - 0.1, lat_max + 0.1)
)
da_subset
Aggregate the data to yearly means.
Process and load the data into memory. This may take a few minutes. Check the Dask dashboard to see the progress
Note: The dask dashboard is not available on Colab.
Plot a time-series at a single location to see the trend.
# Choose a location within the region for plots
location = (5.65, -0.20) # Accra
time_series = da_yearly.interp(lat=location[0], lon=location[1])
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)
time_series.plot.line(ax=ax, x='year', marker='o', linestyle='-', linewidth=1)
ax.set_title('Annual Mean Monthly Maximum Temperature')
plt.show()
We can fit a polynomial at each pixel of the DataArray using the xarray.DataArray.polyfit
method. Here we fit a linear trendline. As out input dataset has yearly
time steps, the slope of the fitted trendline will be degrees per year.
To make the trend more interpretable, we multiply it by 100 to get the
results in the unit degrees per century.
trend = da_yearly.polyfit('year', 1) # fit polynomial of degree 1
slope = trend.polyfit_coefficients[0,...] * 100 # per year -> per century
Visualize the slope of trendline.
projection = ccrs.PlateCarree()
cbar_kwargs = {
'orientation':'horizontal',
'fraction': 0.025,
'pad': 0.05,
'extend':'neither',
'label': '°C per century'
}
fig, ax = plt.subplots(1, 1, subplot_kw={'projection': projection})
fig.set_size_inches(8, 8)
slope.plot.imshow(
ax=ax,
cmap='YlOrBr',
transform=ccrs.PlateCarree(),
add_labels=False,
cbar_kwargs=cbar_kwargs)
region_gdf.plot(
ax=ax,
edgecolor='#000000',
facecolor='none',
alpha=0.5)
ax.set_extent((lon_min,lon_max,lat_min,lat_max), crs = ccrs.PlateCarree())
plt.title(f'Slope of Monthly Maximum Temperature Trend', fontsize = 12)
plt.show()
Save the resulting slope raster as a GeoTIFF.
# Assign a CRS to the DataArray
slope.rio.write_crs('EPSG:4326', inplace=True)
# Clip to the GeoDataFrame coundary
clipped = slope.rio.clip(region_gdf.geometry.values)
# Write the file
output_slope_file = f'{variable}_slope.tif'
output_slope_path = os.path.join(output_folder, output_slope_file)
if not os.path.exists(output_slope_path):
clipped.rio.to_raster(output_slope_path)
print('Saved the file at ', output_slope_path)
We can also save the yearly aggregated subset as a NetCDF for other downstream analysis.
Carry out the trend analysis for your region of interest. Replace the country and region with a chosen Admin-2 region.
This course material is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to re-use and adapt the material but are required to give appropriate credit to the original author as below:
Cloud-basd Remote Sensing with Python Course by Ujaval Gandhi www.spatialthoughts.com
This course is offered as an instructor-led online class. Visit Spatial Thoughts to know details of upcoming sessions.
© 2023 Spatial Thoughts www.spatialthoughts.com
If you want to report any issues with this page, please comment below.