Data integration with ICESat-2 - Part IĀ¶

Credits

  • Zach Fair

  • Ian Joughin

  • Tasha Snow

Learning Objectives

Goals

  • Identify and locate non-ICESat-2 data sets

  • Acquire data from the cloud or via download

  • Open data in pandas and xarray and basic functioning of DataFrames

Key Takeaway

By the end of this tutorial, you will be able to visualize Cloud Optimized Geotiffs with ICESat-2 and ATM data.

For this tutorial, feel free to run the code along with us as we live code by downsizing the zoom window and splitting your screen (or using two screens). Or you can simply watch the zoom walkthrough. Donā€™t worry if you fall behind on the code. The notebook is standalone and you can easily run the code at your own pace another time to catch anything you missed.

We will have one exercise you can type into a notebook, or figure out in a separate document. We will also ask some questions that you can responsd to in the tutorial Slack channel.

Computing environmentĀ¶

Weā€™ll be using the following open source Python libraries in this notebook:

# Suppress library deprecation warnings
import logging
logging.captureWarnings(True)
import ipyleaflet
from ipyleaflet import Map, GeoData, LayersControl,Rectangle, basemaps, basemap_to_tiles, TileLayer, SplitMapControl, Polygon

import ipywidgets
import datetime
import re
# %matplotlib widget
import satsearch
from satsearch import Search
import geopandas as gpd
import ast
import pandas as pd
import geoviews as gv
import hvplot.pandas
from ipywidgets import interact
from IPython.display import display, Image
import intake # if you've installed intake-STAC, it will automatically import alongside intake
import xarray as xr
import matplotlib.pyplot as plt
import boto3
import rasterio as rio
from rasterio.session import AWSSession
from rasterio.plot import show
import rioxarray as rxr
from dask.utils import SerializableLock
import os
import hvplot.xarray
import numpy as np
from pyproj import Proj, transform

1. Identify and acquire the ICESat2 product(s) of interestĀ¶

  • What is the application of this product?

  • What region and resolution is needed?

Download ICESat-2 ATL06 data from desired regionĀ¶

Remember icepyx? We are going to use that again to download some ICESat-2 ATL06 data over our region of interest.

import icepyx as ipx
# Specifying the necessary icepyx parameters
short_name = 'ATL06'
spatial_extent = 'hackweek_kml_jakobshavan.kml' # KML polygon centered on Jakobshavan
date_range = ['2019-04-01', '2019-04-30']
rgts = ['338'] # IS-2 RGT of interest

You may notice that we specified a RGT track. As seen below, a large number of ICESat-2 overpasses occur for Jakobshavan. In the interest of time (and computer memory), we are going to look at only one of these tracks.

# Show image of area of interest (data viz tutorial will get in deeper so don't explain much):
center = [69.2, -50]
zoom = 7

# Open KML file for visualizing
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
jk = gpd.read_file(spatial_extent, driver='KML')

m = Map(basemap=basemap_to_tiles(basemaps.NASAGIBS.ModisAquaTrueColorCR, '2020-07-18'),center=center,zoom=zoom)
geo_data = GeoData(geo_dataframe = jk)

m.add_layer(geo_data)
m.add_control(LayersControl())
m
# Setup the Query object
region = ipx.Query(short_name, spatial_extent, date_range, tracks=rgts)

# Show the available granules
region.avail_granules(ids=True)
[['ATL06_20190420093051_03380303_005_01.h5']]

Looks like we have an ICESat-2 track! Letā€™s quickly visualize the data to ensure that there are no clouds.

# Request information from OpenAltimetry
cyclemap, rgtmap = region.visualize_elevation()

rgtmap
Generating urls
Sending request to OpenAltimetry, please wait...
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00,  1.04it/s]
Plot elevation, please wait...

Looks good! Now itā€™s time to download the data.

# Set Earthdata credentials
uid = 'uwhackweek'
email = 'hackweekadmin@gmail.com'
region.earthdata_login(uid, email)

# Order the granules
region.order_granules()
Total number of data order requests is  1  for  1  granules.
Data request  1  of  1  is submitting to NSIDC
order ID:  5000002845921
Initial status of your order request at NSIDC is:  pending
Your order status is still  pending  at NSIDC. Please continue waiting... this may take a few moments.
Your order status is still  pending  at NSIDC. Please continue waiting... this may take a few moments.
Your order status is still  pending  at NSIDC. Please continue waiting... this may take a few moments.
Your order status is still  processing  at NSIDC. Please continue waiting... this may take a few moments.
Your order is: complete
# Download the data
path = '/tmp/DataIntegration/'
region.download_granules(path)
Beginning download of zipped output...
Data request 5000002845921 of  1  order(s) is downloaded.
Download complete
import h5py

# Load the ICESat-2 data. We will just look at the central beams (GT2R/L)
is2_file = 'processed_ATL06_20190420093051_03380303_005_01_full.h5'
with h5py.File(is2_file, 'r') as f:
    is2_gt2r = pd.DataFrame(data={'lat': f['gt2r/land_ice_segments/latitude'][:],
                                  'lon': f['gt2r/land_ice_segments/longitude'][:],
                                  'elev': f['gt2r/land_ice_segments/h_li'][:]})
    is2_gt2l = pd.DataFrame(data={'lat': f['gt2l/land_ice_segments/latitude'][:],
                                  'lon': f['gt2l/land_ice_segments/longitude'][:],
                                  'elev': f['gt2l/land_ice_segments/h_li'][:]})

2. Identify other products of interestĀ¶

Question - Respond in SlackĀ¶

What research problems have you wanted to address that require more than one dataset?

Where are other data sets stored?Ā¶

More on this in the Cloud Computing Tools tutorial

Today, we will show ATM (non-AWS) and Landsat (AWS)

3. Acquire non-cloud data and open: ATM data accessĀ¶

Why did we choose April 2019 and RGT 338? In Spring 2019, an airborne campaign called Operation IceBridge was flown across Jakobshavan as validation for ICESat-2. Onboard was the Airborne Topographic Mapper, a lidar that works at both 532 nm (like ICESat-2) and 1064 nm (near-infrared). More information about Operation IceBridge and ATM may be found here: https://nsidc.org/data/icebridge

Here, we are going to try and co-register ATM spot measurements with ICESat-2. Because both data sets are rather large, this can be computationally expensive, so we will only consider one flight track with the ATM 532 nm beam.

Operation IceBridge data is not available on the cloud, so this data was downloaded directly from NSIDC. If you are interested in using IceBridge data, NSIDC has a useful data portal here: https://nsidc.org/icebridge/portal/map

# Load the ATM data into a DataFrame
atm_file = 'ILATM2_20190506_151600_smooth_nadir3seg_50pt.csv'
atm_l2 = pd.read_csv(atm_file)

atm_l2.head()
UTC_Seconds_Of_Day Latitude(deg) Longitude(deg) WGS84_Ellipsoid_Height(m) South-to-North_Slope West-to-East_Slope RMS_Fit(cm) Number_Of_ATM_Measurments_Used Number_Of_ATM_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft(m) Track_Identifier
0 54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1
1 54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2
2 54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3
3 54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0
4 54969.75 69.261648 310.351873 487.1317 0.108085 -0.078827 520.83 3753 33 78 1

We opened this data into a Pandas DataFrame, which is a handy tool for Earth data exploration and analysis. The column names derive automatically from the first row of the csv and each row corresponds to an ATM measurement.

Opening and manipulating data in PandasĀ¶

Pandas excels at helping you explore, clean, and process tabular data, such as data stored in spreadsheets or databases. In pandas, a Series is a 1-D data table and a DataFrame is the 2-D data table, which we just saw above.

read_csv, the function we used above, is the easiest way to open a csv data file into a pandas DataFrame. We can specify formatting, data selection, indexing, and much more when reading any data into a pandas DataFrame. Below we read in the data again, specifying different headers, assigning the first column as the index, and assigning the first column as the header (even though we are renaming it, so that it doesnā€™t mistake the first row in the csv as data).

# Load data with specific headers
headers = ['UTC', 'Lat', 'Lon',
       'Height', 'South-to-North_Slope',
       'West-to-East_Slope', 'RMS_Fit', 'Number_Measurements',
       'Number_Measurements_Removed',
       'Distance_Of_Block_To_The_Right_Of_Aircraft', 'Track_Identifier']
atm_l2 = pd.read_csv(atm_file, names=headers, index_col=0, header=0)
atm_l2.head()
Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Track_Identifier
UTC
54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1
54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2
54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3
54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0
54969.75 69.261648 310.351873 487.1317 0.108085 -0.078827 520.83 3753 33 78 1

Now we can explore the data and DataFrame functionsā€¦

# Find out the names of the columns
atm_l2.columns
Index(['Lat', 'Lon', 'Height', 'South-to-North_Slope', 'West-to-East_Slope',
       'RMS_Fit', 'Number_Measurements', 'Number_Measurements_Removed',
       'Distance_Of_Block_To_The_Right_Of_Aircraft', 'Track_Identifier'],
      dtype='object')
# Show data in only one of those columns
atm_l2['Lat'].head()
UTC
54969.50    69.262002
54969.50    69.262065
54969.50    69.262128
54969.50    69.262079
54969.75    69.261648
Name: Lat, dtype: float64
# Same thing, but another way
atm_l2.Lat.head()
UTC
54969.50    69.262002
54969.50    69.262065
54969.50    69.262128
54969.50    69.262079
54969.75    69.261648
Name: Lat, dtype: float64

If we want something more intuitive for our index, we can create a column of datetime objects that uses the date from the ATM file name and seconds_of_day column as date and time information.

# Reset the index
atm_l2 = atm_l2.reset_index()

# Use the date in the string of the file name to create a datetime of the date
date = pd.to_datetime(atm_file[7:15])

# Use UTC seconds_of_day column to calculate time of day that we will use to add time to the datetime
time = pd.to_timedelta(atm_l2.UTC, unit='s')

# Add the time to date and set as index
atm_l2['DateTime'] = date + time
atm_l2 = atm_l2.set_index('DateTime')
atm_l2.head()
UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Track_Identifier
DateTime
2019-05-06 15:16:09.500 54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1
2019-05-06 15:16:09.500 54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2
2019-05-06 15:16:09.500 54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3
2019-05-06 15:16:09.500 54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0
2019-05-06 15:16:09.750 54969.75 69.261648 310.351873 487.1317 0.108085 -0.078827 520.83 3753 33 78 1

Now we can easily slice data by date, month, day, etc.

atm_l2['2019-05-06'].head()
UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Track_Identifier
DateTime
2019-05-06 15:16:09.500 54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1
2019-05-06 15:16:09.500 54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2
2019-05-06 15:16:09.500 54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3
2019-05-06 15:16:09.500 54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0
2019-05-06 15:16:09.750 54969.75 69.261648 310.351873 487.1317 0.108085 -0.078827 520.83 3753 33 78 1

Or by a range of dates

atm_l2['2019-05-06':'2019-05-07']
UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Track_Identifier
DateTime
2019-05-06 15:16:09.500 54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1
2019-05-06 15:16:09.500 54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2
2019-05-06 15:16:09.500 54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3
2019-05-06 15:16:09.500 54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0
2019-05-06 15:16:09.750 54969.75 69.261648 310.351873 487.1317 0.108085 -0.078827 520.83 3753 33 78 1
... ... ... ... ... ... ... ... ... ... ... ...
2019-05-06 15:16:57.250 55017.25 69.198390 310.375657 213.3556 0.070125 0.036402 446.26 2120 11 0 0
2019-05-06 15:16:57.500 55017.50 69.197981 310.373687 212.8147 0.080339 0.024375 462.73 3759 31 83 1
2019-05-06 15:16:57.500 55017.50 69.198079 310.375722 211.2819 0.059184 0.013959 465.01 2112 12 2 2
2019-05-06 15:16:57.500 55017.50 69.198177 310.377757 212.7269 0.049178 -0.010615 391.37 3488 36 -79 3
2019-05-06 15:16:57.500 55017.50 69.198081 310.375774 211.2959 0.061987 0.022402 453.97 2071 16 0 0

771 rows Ɨ 11 columns

SlicingĀ¶

Two methods for slicing:

  • .loc for label-based indexing

  • .iloc for positional indexing

Use loc to slice specific indices

atm_l2.index
DatetimeIndex(['2019-05-06 15:16:09.500000', '2019-05-06 15:16:09.500000',
               '2019-05-06 15:16:09.500000', '2019-05-06 15:16:09.500000',
               '2019-05-06 15:16:09.750000', '2019-05-06 15:16:09.750000',
               '2019-05-06 15:16:09.750000', '2019-05-06 15:16:09.750000',
                      '2019-05-06 15:16:10',        '2019-05-06 15:16:10',
               ...
                      '2019-05-06 15:16:57',        '2019-05-06 15:16:57',
               '2019-05-06 15:16:57.250000', '2019-05-06 15:16:57.250000',
               '2019-05-06 15:16:57.250000', '2019-05-06 15:16:57.250000',
               '2019-05-06 15:16:57.500000', '2019-05-06 15:16:57.500000',
               '2019-05-06 15:16:57.500000', '2019-05-06 15:16:57.500000'],
              dtype='datetime64[ns]', name='DateTime', length=771, freq=None)
atm_l2.loc['2019-05-06 15:16:09.500']
UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Track_Identifier
DateTime
2019-05-06 15:16:09.500 54969.5 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1
2019-05-06 15:16:09.500 54969.5 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2
2019-05-06 15:16:09.500 54969.5 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3
2019-05-06 15:16:09.500 54969.5 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0

Select indices and columns

atm_l2.loc['2019-05-06 15:16:09.500', ['Lat', 'Lon', 'Height']]
Lat Lon Height
DateTime
2019-05-06 15:16:09.500 69.262002 310.351764 490.3974
2019-05-06 15:16:09.500 69.262065 310.353395 500.2330
2019-05-06 15:16:09.500 69.262128 310.355026 500.3090
2019-05-06 15:16:09.500 69.262079 310.353741 498.9152

Use iloc to select by row and column number

atm_l2.iloc[[0,1],[2,3,4]]
Lon Height South-to-North_Slope
DateTime
2019-05-06 15:16:09.500 310.351764 490.3974 0.077354
2019-05-06 15:16:09.500 310.353395 500.2330 -0.048777

Statistical manipulationsĀ¶

Say we want to know the mean ATM height for the dataā€¦

atm_l2['Height'].mean()
270.60600557717254
atm_l2['Height'].describe()
count    771.000000
mean     270.606006
std      115.970329
min       82.582400
25%      182.211950
50%      231.822500
75%      346.061800
max      506.778400
Name: Height, dtype: float64
atm_l2['Number_Measurements'].sum()
2178525

If we want to know the mean of each column grouped for each Track Identifierā€¦

# Group rows together by a column and take the mean of each group
atm_l2.groupby('Track_Identifier').mean()
UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft
Track_Identifier
0 54993.50000 69.229326 310.364939 273.940858 0.052332 0.112684 528.128653 2166.176166 21.186528 0.000000
1 54993.50000 69.229204 310.362120 257.611426 0.061328 0.191712 526.484093 3515.626943 28.321244 111.932642
2 54993.40625 69.229421 310.364227 271.031019 0.053800 0.107376 523.098750 2088.322917 22.244792 26.671875
3 54993.50000 69.229390 310.366413 279.842922 0.050348 0.088295 537.783316 3528.388601 21.300518 -58.575130

Groupby is pretty complex. You can dive deeper here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

You can also resample your data to get the mean of all measurements at 30 second intervalsā€¦

atm_l2.resample('30S').mean()
UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Track_Identifier
DateTime
2019-05-06 15:16:00 54979.625000 69.247859 310.358014 372.339821 0.121164 0.140131 525.717805 2918.50000 28.612805 20.448171 1.500000
2019-05-06 15:16:30 55003.732506 69.215620 310.369172 195.281646 0.005060 0.113866 531.223386 2756.78781 19.304740 19.665914 1.498871

To use your own functions, one might first try to use a for loop to iterate over rows or columns of data. Pandas has made an easy and fast alternative, apply(). This function acts as a map() function in Python. It takes a function as an input and applies the function to an entire DataFrame.

Something easy could be to take the median of each column of the data. We specify the np.median function and axis=0 to pass in all rows and iterate over each column.

atm_l2.apply(np.median, axis=0)
UTC                                           54993.500000
Lat                                              69.228910
Lon                                             310.365003
Height                                          231.822500
South-to-North_Slope                              0.044511
West-to-East_Slope                                0.108179
RMS_Fit                                         513.780000
Number_Measurements                            3121.000000
Number_Measurements_Removed                      14.000000
Distance_Of_Block_To_The_Right_Of_Aircraft        0.000000
Track_Identifier                                  1.000000
dtype: float64

Say you want to use only specific rows or columns in your function. For instance, you want to know the total number of measurements (i.e. Number_Measurements + Number_Measurements_Removed). We already have a function that takes two columns and adds them together. Now we want to apply it to the data.

First, we call the .apply() method on the atm_l2 dataframe. Then use the lambda function to iterate over the rows (or columns) of the dataframe. For every row, we grab the Number_Measurements and Number_Measurements_Removed columns and pass them to the calc_total function. Finally, we specify the axis=1 to tell the .apply() method that we are passing in columns to apply the function to each row.

def calc_total(a,b):
    total = a + b
    return total

atm_l2['Total_Measurements'] = atm_l2.apply(lambda row: calc_total(row['Number_Measurements'], row['Number_Measurements_Removed']), axis=1)
atm_l2.head()
UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Track_Identifier Total_Measurements
DateTime
2019-05-06 15:16:09.500 54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1 3728.0
2019-05-06 15:16:09.500 54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2 2206.0
2019-05-06 15:16:09.500 54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3 3648.0
2019-05-06 15:16:09.500 54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0 2833.0
2019-05-06 15:16:09.750 54969.75 69.261648 310.351873 487.1317 0.108085 -0.078827 520.83 3753 33 78 1 3786.0

ExerciseĀ¶

Try making .apply work with this new function to create a new column, Distance_to_Jakobshavn, using the Lat and Lon columns as inputs. Weā€™ve already supplied the function, which requires a latitude and longitude input to calculate the distance between the ATM measurement and a specified point on the Jakobshavn Glacier ice front (jlat/jlon). Complete the line that applies the function to those columns in the data to get the Distance_to_Jakobshavn.

from math import sin, cos, sqrt, atan2, radians

def distance(a,b):
    '''
    Calculate distance between a set point and a lat lon pair from the data
    a = lat
    b = lon
    '''
    
    jlat,jlon = 69.2330, -49.2434
    
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(jlat)
    lon1 = radians(jlon)
    lat2 = radians(a)
    lon2 = radians(b)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    return distance

atm_l2['Distance_to_Jakobshavn']  = .....
atm_l2.head()
  Input In [31]
    atm_l2['Distance_to_Jakobshavn']  = .....
                                            ^
SyntaxError: invalid syntax

Great work! Now letā€™s reset the index to start fresh for multi-indexing.

atm_l2 = atm_l2.reset_index()
atm_l2.head()
DateTime UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Track_Identifier Total_Measurements
0 2019-05-06 15:16:09.500 54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1 3728.0
1 2019-05-06 15:16:09.500 54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2 2206.0
2 2019-05-06 15:16:09.500 54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3 3648.0
3 2019-05-06 15:16:09.500 54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0 2833.0
4 2019-05-06 15:16:09.750 54969.75 69.261648 310.351873 487.1317 0.108085 -0.078827 520.83 3753 33 78 1 3786.0

Multi-indexingĀ¶

Multi-level indexing opens the door to more sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

Here, we will demonstrate a few basic things you can do with MultiIndexing. If we wanted to think about our data by DateTime and then by Track Identifier, we would set both columns as indicesā€¦

atm_l2 = atm_l2.set_index(['DateTime','Track_Identifier']).sort_index()
atm_l2
UTC Lat Lon Height South-to-North_Slope West-to-East_Slope RMS_Fit Number_Measurements Number_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft Total_Measurements
DateTime Track_Identifier
2019-05-06 15:16:09.500 0 54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 2833.0
1 54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 3728.0
2 54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2206.0
3 54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3648.0
2019-05-06 15:16:09.750 0 54969.75 69.261725 310.353850 500.6826 -0.027762 -0.056691 446.38 2732 30 0 2762.0
... ... ... ... ... ... ... ... ... ... ... ... ...
2019-05-06 15:16:57.250 3 55017.25 69.198486 310.377647 215.1597 0.056693 -0.040250 448.01 3462 8 -79 3470.0
2019-05-06 15:16:57.500 0 55017.50 69.198081 310.375774 211.2959 0.061987 0.022402 453.97 2071 16 0 2087.0
1 55017.50 69.197981 310.373687 212.8147 0.080339 0.024375 462.73 3759 31 83 3790.0
2 55017.50 69.198079 310.375722 211.2819 0.059184 0.013959 465.01 2112 12 2 2124.0
3 55017.50 69.198177 310.377757 212.7269 0.049178 -0.010615 391.37 3488 36 -79 3524.0

771 rows Ɨ 11 columns

For some terminology, the levels of a MultiIndex are the former column names (UTC, Track_Identifier). The labels are the actual values in a level, (2019-05-06 15:16:09.500, 0-3, ā€¦). Levels can be referred to by name or position, with 0 being the outermost level.

Slicing the outermost index level is pretty easy, we just use our regular .loc[row_indexer, column_indexer] to grab a couple datetimes we want. Weā€™ll select the columns Lat and Lon where the UTC was 2019-05-06 15:16:09.500 and 2019-05-06 15:16:09.750.

atm_l2.loc['2019-05-06 15:16:09.500',['Lat','Lon']]
Lat Lon
Track_Identifier
0 69.262079 310.353741
1 69.262002 310.351764
2 69.262065 310.353395
3 69.262128 310.355026

If you wanted to select the rows whose track was 0 or 1, .loc wants [row_indexer, column_indexer] so letā€™s wrap the two elements of our row indexer (the list of UTCs and the Tracks) in a tuple to make it a single unit:

atm_l2.loc[('2019-05-06 15:16:09.500',[0,1]),['Lat','Lon']]
Lat Lon
DateTime Track_Identifier
2019-05-06 15:16:09.500 0 69.262079 310.353741
1 69.262002 310.351764

Now you want info from any UTC but for the specific tracks again (0,1). Below the : indicates all labels in this level.

atm_l2.loc[pd.IndexSlice[:,[0,1]], ['Lat','Lon']]
Lat Lon
DateTime Track_Identifier
2019-05-06 15:16:09.500 0 69.262079 310.353741
1 69.262002 310.351764
2019-05-06 15:16:09.750 0 69.261725 310.353850
1 69.261648 310.351873
2019-05-06 15:16:10.000 0 69.261371 310.353959
... ... ... ...
2019-05-06 15:16:57.000 1 69.198599 310.373444
2019-05-06 15:16:57.250 0 69.198390 310.375657
1 69.198292 310.373612
2019-05-06 15:16:57.500 0 69.198081 310.375774
1 69.197981 310.373687

386 rows Ɨ 2 columns

You can do a lot with groupby, pivoting, and reshaping, but I wonā€™t dive into that in this tutorial. You can check out more here: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

Now that we are oriented with the pandas DataFrame, letā€™s get back to the ATM data we have opened and co-register it with the ICESat-2 lines.

Co-register ICESat-2 with ATM dataĀ¶

# Load the ATM data into a DataFrame
atm_file = 'ILATM2_20190506_151600_smooth_nadir3seg_50pt.csv'
atm_l2 = pd.read_csv(atm_file)

atm_l2.head()
UTC_Seconds_Of_Day Latitude(deg) Longitude(deg) WGS84_Ellipsoid_Height(m) South-to-North_Slope West-to-East_Slope RMS_Fit(cm) Number_Of_ATM_Measurments_Used Number_Of_ATM_Measurements_Removed Distance_Of_Block_To_The_Right_Of_Aircraft(m) Track_Identifier
0 54969.50 69.262002 310.351764 490.3974 0.077354 -0.069179 589.57 3723 5 78 1
1 54969.50 69.262065 310.353395 500.2330 -0.048777 0.006024 434.12 2185 21 14 2
2 54969.50 69.262128 310.355026 500.3090 0.068798 0.077559 777.80 3640 8 -51 3
3 54969.50 69.262079 310.353741 498.9152 -0.085600 -0.111001 472.64 2818 15 0 0
4 54969.75 69.261648 310.351873 487.1317 0.108085 -0.078827 520.83 3753 33 78 1

The ATM L2 file contains plenty of information, including surface height estimates and slope of the local topography. It also contains a track identifier - ATM takes measurements from multiple parts of the aircraft, namely starboard, port, and nadir. To keep things simple, we will filter the DataFrame to only look at the nadir track (Track_Identifier = 0).

atm_l2 = atm_l2[atm_l2['Track_Identifier']==0]

# Change the longitudes to be consistent with ICESat-2
atm_l2['Longitude(deg)'] -= 360

print(atm_l2.size)
2123

Letā€™s take a quick look at where ATM is relative to ICESat-2ā€¦

# Subset the ICESat-2 data to the ATM latitudes
is2_gt2r = is2_gt2r[(is2_gt2r['lat']<atm_l2['Latitude(deg)'].max()) & (is2_gt2r['lat']>atm_l2['Latitude(deg)'].min())]
is2_gt2l = is2_gt2l[(is2_gt2l['lat']<atm_l2['Latitude(deg)'].max()) & (is2_gt2l['lat']>atm_l2['Latitude(deg)'].min())]


# Set up a map with the flight tracks as overlays
from ipyleaflet import Map, basemaps, basemap_to_tiles, Polyline

m = Map(
    basemap=basemap_to_tiles(basemaps.Esri.WorldImagery),
    center=(69.25, 310.35-360),
    zoom=10
)

gt2r_line = Polyline(
    locations=[
        [is2_gt2r['lat'].min(), is2_gt2r['lon'].max()],
        [is2_gt2r['lat'].max(), is2_gt2r['lon'].min()]
    ],
    color="green" ,
    fill=False
)
m.add_layer(gt2r_line)

gt2l_line = Polyline(
    locations=[
        [is2_gt2l['lat'].min(), is2_gt2l['lon'].max()],
        [is2_gt2l['lat'].max(), is2_gt2l['lon'].min()]
    ],
    color="green" ,
    fill=False
)
m.add_layer(gt2l_line)

atm_line = Polyline(
    locations=[
        [atm_l2['Latitude(deg)'].min(), atm_l2['Longitude(deg)'].max()],
        [atm_l2['Latitude(deg)'].max(), atm_l2['Longitude(deg)'].min()]
    ],
    color="orange" ,
    fill=False
)
m.add_layer(atm_line)

m

Looks like ATM aligns very closely with the left beam (GT2L), so hopefully the two beams will agree. The terrain over this region is quite rough, so we may expect some differences between ATM and GT2R. ICESat-2 also flew over Jakobshavan 16 days before ATM, so there might be slight differences due to ice movement.

We have looked at how we can quickly download ICESat-2 and airborne lidar data, and process them using pandas. We will engage in a more thorough analysis in the Data Integration session tomorrow.

3. Search and open (Landsat) raster imagery from the cloudĀ¶

Letā€™s now talk about cloud-optimized approaches that require no downloading to search and access the data.

Cloud optimized approachesĀ¶

  • Organize data as an aggregation of small, independently retrievable objects (e.g., zarr, HDF2, Geotiff in the Cloud)

  • Allow access to pieces of large objects (e.g., Cloud-Optimized GeoTIFF3, OPeNDAP4 in the Cloud)

We will be working with Cloud Optimized GeoTIFF (COG). A COG is a GeoTIFF file with an internal organization that enables more efficient workflows by allowing us to ask for just the parts of a file we need, instead of having to open the entire image or data set (see more at https://www.cogeo.org/).

Example url for accessing a band in the cloud from an S3 bucket:
s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2020/202/025/LC08_L1TP_202025_20190420_20190507_01_T1/LC08_L1TP_202025_20190420_20190507_01_T1_B4.TIF

User Manual for more information about accessing Landsat S3:
https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/atoms/files/LSDS-2032-Landsat-Commercial-Cloud-Direct-Access-Users-Guide-v2.pdf.pdf

Search for Landsat imageryĀ¶

To explore and access COGā€™s easily we will use a SpatioTemporal Asset Catalog (STAC). The STAC provides a common metadata format to make it easier to index and querrying S3 buckets for geospatial data. Learn more here: https://github.com/radiantearth/stac-spec.

# Sets up credentials for acquiring images through dask/xarray
os.environ["AWS_REQUEST_PAYER"] = "requester"

# Sets up proper credentials for acquiring data through rasterio
aws_session = AWSSession(boto3.Session(), requester_pays=True)

Extract geometry bounds from the ICESat-2 KML file used above so that we can perform the Landsat spatial search.

# Extract geometry bounds
geom = jk.geometry[0]
print(geom.bounds)
(-51.3229009069365, 68.84029223511094, -48.20366423696812, 69.61656633135274)

We will search for imagery in STAC catalog using satsearch: https://github.com/sat-utils/sat-search

# Search STAC API for Landsat images based on a bounding box, date and other metadata if desired

bbox = (geom.bounds[0], geom.bounds[1], geom.bounds[2], geom.bounds[3]) #(west, south, east, north) 

timeRange = '2019-05-06/2019-05-07'
    
results = Search.search(url='https://ibhoyw8md9.execute-api.us-west-2.amazonaws.com/prod',
                        collection='usgs-landsat/collection02/',
                        datetime=timeRange,
                        bbox=bbox,    
                        # properties=properties,
                        sort=['<datetime'])

print('%s items' % results.found())
items = results.items()

# Save search to geojson file
gjson_outfile = f'/home/jovyan/website2022/book/tutorials/DataIntegration/Landsat.geojson'
items.save(gjson_outfile)
6 items

We can include property searches, such as path, row, cloud-cover, as well with the properties flag.

We are given a satsearch collection of items (images)

items
<satstac.itemcollection.ItemCollection at 0x7f119f27fe20>

Load the geojson file into geopandas and inspect the items we want to collect

# Load the geojson file
gf = gpd.read_file(gjson_outfile)
gf.head(2)
id datetime eo:cloud_cover view:sun_azimuth view:sun_elevation platform view:off_nadir landsat:cloud_cover_land landsat:wrs_type landsat:wrs_path landsat:wrs_row landsat:scene_id landsat:collection_category landsat:collection_number landsat:correction proj:epsg created updated geometry
0 LC08_L1TP_008012_20190507_20200829_02_T1 2019-05-07T14:54:18.865854+00:00 0.18 173.852645 38.463606 LANDSAT_8 0 0.0 2 008 012 LC80080122019127LGN00 T1 02 L1TP 32622 2021-07-24T04:02:59.258999+00:00 2021-07-24T04:02:59.258999+00:00 POLYGON ((-50.65493 69.41549, -52.45035 67.796...
1 LC08_L2SP_008012_20190507_20200829_02_T1_SR 2019-05-07T14:54:18.865854+00:00 0.18 173.852645 38.463606 LANDSAT_8 0 0.0 2 008 012 LC80080122019127LGN00 T1 02 L2SP 32622 2021-07-24T04:03:48.757000+00:00 2021-07-24T04:03:48.757000+00:00 POLYGON ((-50.65493 69.41549, -52.45035 67.796...
# Plot search area of interest and frames on a map using Holoviz Libraries (more on these later)
cols = gf.loc[:,('id','landsat:wrs_path','landsat:wrs_row','geometry')]
footprints = cols.hvplot(geo=True, line_color='k', hover_cols=['landsat:wrs_path','landsat:wrs_row'], alpha=0.2, title='Landsat 8 T1',tiles='ESRI')
tiles = gv.tile_sources.CartoEco.options(width=700, height=500) 
labels = gv.tile_sources.StamenLabels.options(level='annotation')
tiles * footprints * labels

Intake all scenes using the intake-STAC libraryĀ¶

Intake-STAC facilitates discovering, exploring, and loading spatio-temporal datasets by providing Intake Drivers for STAC catalogs. This provides a simple toolkit for working with STAC catalogs and for loading STAC assets as xarray objects.

catalog = intake.open_stac_item_collection(items)
list(catalog)
['LC08_L1TP_008012_20190507_20200829_02_T1',
 'LC08_L2SP_008012_20190507_20200829_02_T1_SR',
 'LC08_L2SP_008012_20190507_20200829_02_T1_ST',
 'LC08_L1TP_008011_20190507_20200828_02_T1',
 'LC08_L2SP_008011_20190507_20200828_02_T1_SR',
 'LC08_L2SP_008011_20190507_20200828_02_T1_ST']

Letā€™s explore the metadata and keys for the first scene

sceneids = list(catalog)
item3 = catalog[sceneids[3]]
# item3.metadata
for keys in item3.keys():
    print (keys)
thumbnail
reduced_resolution_browse
index
coastal
blue
green
red
nir08
swir16
swir22
pan
cirrus
lwir11
lwir12
ANG.txt
MTL.txt
MTL.xml
MTL.json
qa_pixel
qa_radsat
VAA
VZA
SAA
SZA

We can explore the metadata for any of these:

item3['blue'].metadata
{'title': 'Blue Band (B2)',
 'description': 'Collection 2 Level-1 Blue Band (B2) Top of Atmosphere Radiance',
 'type': 'image/vnd.stac.geotiff; cloud-optimized=true',
 'roles': ['data'],
 'eo:bands': [{'name': 'B2',
   'common_name': 'blue',
   'gsd': 30,
   'center_wavelength': 0.48}],
 'href': 'https://landsatlook.usgs.gov/data/collection02/level-1/standard/oli-tirs/2019/008/011/LC08_L1TP_008011_20190507_20200828_02_T1/LC08_L1TP_008011_20190507_20200828_02_T1_B2.TIF',
 'alternate': {'s3': {'storage:platform': 'AWS',
   'storage:requester_pays': True,
   'href': 's3://usgs-landsat/collection02/level-1/standard/oli-tirs/2019/008/011/LC08_L1TP_008011_20190507_20200828_02_T1/LC08_L1TP_008011_20190507_20200828_02_T1_B2.TIF'}},
 'file:checksum': '1340f0a602019909fbb3e6bd909443ef7f9adfe7efabc86586f6e5f85e941aab30f03adbde671e000852b535ce627d864c1e6c3b33011a907b9a67fa1b7e595d9f42',
 'plots': {'geotiff': {'kind': 'image',
   'x': 'x',
   'y': 'y',
   'frame_width': 500,
   'data_aspect': 1,
   'rasterize': True,
   'dynamic': True,
   'cmap': 'viridis'}},
 'catalog_dir': ''}
# This is the url needed to grab data from the S3 bucket
# From the satsearch catalog:
items[3].asset('blue')['alternate']['s3']['href'] # can use item asset name or title (blue or B2)

# From the intake-STAC catalog
item3.blue.metadata['alternate']['s3']['href'] # must use item asset name (blue)
's3://usgs-landsat/collection02/level-1/standard/oli-tirs/2019/008/011/LC08_L1TP_008011_20190507_20200828_02_T1/LC08_L1TP_008011_20190507_20200828_02_T1_B2.TIF'

Open and visualize each image using RasterIOĀ¶

import rasterio as rio

# Retrieve first scene using rio
item_url = item3.blue.metadata['alternate']['s3']['href']

# Read and plot with grid coordinates 
with rio.Env(aws_session):
    with rio.open(item_url) as src:
        fig, ax = plt.subplots(figsize=(9,8))
        
        # To plot
        show(src,1)
        
        # To open data into a numpy array
        profile = src.profile
        arr = src.read(1)
../../_images/dataintegration-1_113_0.png

We can open directly into xarray using rasterIOā€¦

Manipulating data in XarrayĀ¶

Pandas and xarray have very similar structures and ways of manipulating data, but pandas excels with 2-D data and xarray is ideal for higher dimension data. Xarray introduces labels in the form of dimensions, coordinates and attributes on top of Pandas-like DataFrames.

Xarray has 2 fundamental data structures:

  • DataArray, which holds single multi-dimensional variables and its coordinates

  • Dataset, which holds multiple variables that potentially share the same coordinates

https://xarray-contrib.github.io/xarray-tutorial/_images/xarray-data-structures.png

We will only scratch the surface here on what xarray can do. To learn more, there are great xarray tutorials here: https://xarray-contrib.github.io/xarray-tutorial/online-tutorial-series/01_xarray_fundamentals.html

RasterIO and RioXarrayĀ¶

We can use rasterIO to easily open into an xarray DataArray:

rastxr = xr.open_dataarray(item_url,engine='rasterio')
rastxr
<xarray.DataArray 'band_data' (band: 1, y: 8771, x: 8741)>
[76667311 values with dtype=float32]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 2.613e+05 2.613e+05 ... 5.235e+05 5.235e+05
  * y            (y) float64 7.856e+06 7.855e+06 ... 7.592e+06 7.592e+06
    spatial_ref  int64 0

Or a DataSet:

rastxr = xr.open_dataset(item_url,engine='rasterio')
rastxr
<xarray.Dataset>
Dimensions:      (band: 1, x: 8741, y: 8771)
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 2.613e+05 2.613e+05 ... 5.235e+05 5.235e+05
  * y            (y) float64 7.856e+06 7.855e+06 ... 7.592e+06 7.592e+06
    spatial_ref  int64 0
Data variables:
    band_data    (band, y, x) float32 ...

We can open using rioxarray, which integrates rasterIO and xarray and is the most efficient way of opening using rasterIO:

import rioxarray as rxr

rastrxr = rxr.open_rasterio(item_url)
rastrxr
<xarray.DataArray (band: 1, y: 8771, x: 8741)>
[76667311 values with dtype=uint16]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 2.613e+05 2.613e+05 ... 5.235e+05 5.235e+05
  * y            (y) float64 7.856e+06 7.855e+06 ... 7.592e+06 7.592e+06
    spatial_ref  int64 0
Attributes:
    _FillValue:    0.0
    scale_factor:  1.0
    add_offset:    0.0

We can see Attributes have been added to the Xarray using the same url.

Beyond what xarray and rasterIO provide, rioxarray has these added benefits (plus others):

  • Supports multidimensional datasets such as netCDF

  • Loads in the CRS, transform, and nodata metadata in standard CF & GDAL locations

  • Supports masking and scaling data

  • Loads raster metadata into the attributes

For more info: https://corteva.github.io/rioxarray/stable/index.html

DaskĀ¶

Another convenient means for opening a lot of raster data into xarray is using dask. Xarray integrates with Dask to support parallel computations and streaming computation on datasets that donā€™t fit into memory. So this is perfect when you want to process a lot of data.

Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.

Unlike NumPy, which has eager evaluation, operations on dask arrays are lazy. Operations queue up a series of tasks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.

More on Dask in the Cloud Computing Tools tutorial.

To expand our xarray toolbox for working with larger data sets that we donā€™t necessarily want entirely in memory, we will start by reading in 3 bands of a Landsat scene to xarray using dask.

sceneid = catalog[sceneids[0]]
print (sceneid.name)

band_names = ['red','green','blue']

bands = []

# Construct xarray for scene
for band_name in band_names:
    # Specify chunk size (x,y), Landsat COG is natively in 512 chunks so is best to use this or a multiple
    band = sceneid[band_name](chunks=dict(band=1, x=2048, y=2048),urlpath=sceneid[band_name].metadata['alternate']['s3']['href']).to_dask()
    band['band'] = [band_name]
    bands.append(band)
scene = xr.concat(bands, dim='band')
scene
LC08_L1TP_008012_20190507_20200829_02_T1
<xarray.DataArray (band: 3, y: 8491, x: 8431)>
dask.array<concatenate, shape=(3, 8491, 8431), dtype=uint16, chunksize=(1, 2048, 2048), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) <U5 'red' 'green' 'blue'
  * y        (y) float64 7.702e+06 7.702e+06 7.702e+06 ... 7.447e+06 7.447e+06
  * x        (x) float64 4.371e+05 4.371e+05 4.372e+05 ... 6.9e+05 6.9e+05
Attributes:
    transform:      (30.0, 0.0, 437085.0, 0.0, -30.0, 7701615.0)
    crs:            +init=epsg:32622
    res:            (30.0, 30.0)
    is_tiled:       1
    nodatavals:     (0.0,)
    scales:         (1.0,)
    offsets:        (0.0,)
    AREA_OR_POINT:  Point

Typically, itā€™s best to align dask chunks with the way image chunks (typically called ā€œtilesā€) are stored on disk or cloud storage buckets. The landsat data is stored on AWS S3 in a tiled Geotiff format where tiles are 512x512, so we should pick some multiple of that, and typically aim for chunk sizes of ~100Mb (although this is subjective). You can read more about dask chunks here: https://docs.dask.org/en/latest/array-best-practices.html.

Similarly as with pandas, we can explore variables easily in xarray. This time we can work with coordinates (equivalent to indices in pandas). Here x might often be the longitude (it can be renamed to this actually):

scene.x
<xarray.DataArray 'x' (x: 8431)>
array([437100., 437130., 437160., ..., 689940., 689970., 690000.])
Coordinates:
  * x        (x) float64 4.371e+05 4.371e+05 4.372e+05 ... 6.9e+05 6.9e+05

We can also keep track of arbitrary metadata (called attributes) in the form of a Python dictionary:

scene.attrs
{'transform': (30.0, 0.0, 437085.0, 0.0, -30.0, 7701615.0),
 'crs': '+init=epsg:32622',
 'res': (30.0, 30.0),
 'is_tiled': 1,
 'nodatavals': (0.0,),
 'scales': (1.0,),
 'offsets': (0.0,),
 'AREA_OR_POINT': 'Point'}
scene.crs
'+init=epsg:32622'

We can apply operations over dimensions by name. Here, if we want to slice the data to only have the blue band:

scene.sel(band='blue')
<xarray.DataArray (y: 8491, x: 8431)>
dask.array<getitem, shape=(8491, 8431), dtype=uint16, chunksize=(2048, 2048), chunktype=numpy.ndarray>
Coordinates:
    band     <U5 'blue'
  * y        (y) float64 7.702e+06 7.702e+06 7.702e+06 ... 7.447e+06 7.447e+06
  * x        (x) float64 4.371e+05 4.371e+05 4.372e+05 ... 6.9e+05 6.9e+05
Attributes:
    transform:      (30.0, 0.0, 437085.0, 0.0, -30.0, 7701615.0)
    crs:            +init=epsg:32622
    res:            (30.0, 30.0)
    is_tiled:       1
    nodatavals:     (0.0,)
    scales:         (1.0,)
    offsets:        (0.0,)
    AREA_OR_POINT:  Point

Notice that instead of loc (from pandas) we are using sel, but they function synonymously.

Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names. Letā€™s determine the mean reflectance for the blue band:

scene.sel(band='blue').mean()#.values 
<xarray.DataArray ()>
dask.array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>
Coordinates:
    band     <U5 'blue'

And you can easily use the split-apply-combine paradigm with groupby, which we wonā€™t show here.

Advanced multi-dimensional read-ins, manipulation and visualizationĀ¶

Now letā€™s open all the bands and multiple days together into an xarray. This will be a more complex xarray with an extra 'time' dimension. Since the catalog we have has a combination of Level 1 and 2 data, letā€™s keep only the scene IDs for Level 1 data.

sceneids = list(catalog)
sceneids = [sceneid for sceneid in sceneids if sceneid.endswith('_T1')]
sceneids
['LC08_L1TP_008012_20190507_20200829_02_T1',
 'LC08_L1TP_008011_20190507_20200828_02_T1']

Letā€™s create the time variable first for the xarray time dimension. This finds the desired scene IDs in the geopandas dataframe we have above and extracts their ā€˜datetimeā€™ information. These datetimes get recorded into an xarray variable object for ā€˜timeā€™.

# Create time variable for time dim
time_var = xr.Variable('time',gf.loc[gf.id.isin(sceneids)]['datetime'])
time_var
<xarray.Variable (time: 2)>
array(['2019-05-07T14:54:18.865854+00:00',
       '2019-05-07T14:53:54.970581+00:00'], dtype=object)

Now we will search and collect band names for grabbing each desired band. We will just grab the bands that have 30 m pixels. This provides and example of how you can search data in the STAC catalog.

band_names = []

# Get band names
sceneid = catalog[sceneids[1]]
for k in sceneid.keys():
    M = getattr(sceneid, k).metadata
    if 'eo:bands' in M:
        resol = M['eo:bands'][0]['gsd']
        print(k, resol)
        if resol == 30: 
            band_names.append(k)
            
# Add qa band
band_names.append('qa_pixel')
coastal 30
blue 30
green 30
red 30
nir08 30
swir16 30
swir22 30
pan 15
cirrus 30
lwir11 100
lwir12 100

And now open all of itā€¦

# Import to xarray
# In xarray dataframe nans are in locations where concat of multiple scenes has expanded the grid (i.e. different path/rows).
scenes = []

for sceneid in sceneids:
    sceneid = catalog[sceneid]
    print (sceneid.name)

    bands = []

    # Construct xarray for scene, open each band, append and concatenate together to create a scene, 
    # then append and concatenate each scene to create the full dataframe 
    for band_name in band_names:
        band = sceneid[band_name](chunks=dict(band=1, x=2048, y=2048),urlpath=sceneid[band_name].metadata['alternate']['s3']['href']).to_dask()
        band['band'] = [band_name]
        bands.append(band)
    scene = xr.concat(bands, dim='band')
    scenes.append(scene)

# Concatenate scenes with time variable
ls_scenes = xr.concat(scenes, dim=time_var)

ls_scenes
LC08_L1TP_008012_20190507_20200829_02_T1
LC08_L1TP_008011_20190507_20200828_02_T1
<xarray.DataArray (time: 2, band: 9, y: 13621, x: 14291)>
dask.array<concatenate, shape=(2, 9, 13621, 14291), dtype=float64, chunksize=(1, 1, 2048, 2048), chunktype=numpy.ndarray>
Coordinates:
  * y        (y) float64 7.447e+06 7.447e+06 7.447e+06 ... 7.855e+06 7.856e+06
  * x        (x) float64 2.613e+05 2.613e+05 2.614e+05 ... 6.9e+05 6.9e+05
  * band     (band) <U8 'coastal' 'blue' 'green' ... 'cirrus' 'qa_pixel'
  * time     (time) object '2019-05-07T14:54:18.865854+00:00' '2019-05-07T14:...
Attributes:
    transform:      (30.0, 0.0, 437085.0, 0.0, -30.0, 7701615.0)
    crs:            +init=epsg:32622
    res:            (30.0, 30.0)
    is_tiled:       1
    nodatavals:     (0.0,)
    scales:         (1.0,)
    offsets:        (0.0,)
    AREA_OR_POINT:  Point

We now have 2 Landsat scenes with all of the bands we are interested in stored in an xarray, but you can imagine that this exact code can scale to years worth of data and bands.

From here, we easily subset one image at a time or the entire xarray:

sbands = ['blue', 'nir08', 'swir16']

# Select the first datetime
t = ls_scenes.time.values[1]
print (t)

# Upper left and lower right coordinates for subsetting to Jakobshavn area
ulx = 300000
uly = 7695000
lrx = 330000
lry = 7670000

image = ls_scenes.sel(time=t,band=sbands,y=slice(lry,uly),x=slice(ulx,lrx))
2019-05-07T14:53:54.970581+00:00
fig, ax = plt.subplots(figsize=(8,6))
image.sel(band='blue').plot()
<matplotlib.collections.QuadMesh at 0x7f11ac2ea5b0>
../../_images/dataintegration-1_153_1.png

Since this data is in UTM 22N, we can reproject to the standard lat/lon coordinate system (WGS-84) and map with the ICESat-2 and ATM lines.

image = image.rio.reproject(4326)

ISlats = [is2_gt2r['lat'].min(), is2_gt2r['lat'].max()]
# ISlons = (is2_gt2r['lon'].max(), is2_gt2r['lon'].min())
ISlons = [-55.624,-55.646]

ATMlats = [atm_l2['Latitude(deg)'].min(), atm_l2['Latitude(deg)'].max()]
# ATMlons = [atm_l2['Longitude(deg)'].max(), atm_l2['Longitude(deg)'].min()]
ATMlons = [-55.624,-55.646]

fig, ax = plt.subplots(figsize=(8,6))
image.sel(band='blue').plot()
plt.plot(ISlons,ISlats,color = 'green')
plt.plot(ATMlons,ATMlats,color = 'orange')
Warning 1: +init=epsg:XXXX syntax is deprecated. It might return a CRS with a non-EPSG compliant axis order.
[<matplotlib.lines.Line2D at 0x7f119d509a60>]
../../_images/dataintegration-1_155_2.png

The reprojection to WGS-84 calculated the longitudes to be 6 degrees off, so we shifted the IS2 and ATM data for ease of visualization. This issue only seems to arise with reprojections from some UTM projections and should not be an issue with most data.

4. SummaryĀ¶

Congratulations! Youā€™ve completed the tutorial. In this tutorial you have gained the skills to:

  • Search for non-ICESat-2 datasets

  • Open data into pandas and xarray dataframes/arrays, and

  • Manipulate, visualize, and explore the data

We have concluded by mapping the three data sets together. We will revisit the ICESat-2 and ATM data at Jakobshavn and integrate it with processed velocity data in Part II of this tutorial tomorrow!