Cloud Computing Tutorial

# How many CPUs are running on this machine?
!lscpu | grep ^CPU\(s\):

CPU(s):                          2

# How much memory is available?
!free -h

              total        used        free      shared  buff/cache   available
Mem:          6.8Gi       851Mi       4.0Gi       9.0Mi       2.0Gi       5.6Gi
Swap:         4.0Gi        11Mi       4.0Gi

Search -> Download -> Process -> Research

Search -> Process in the cloud -> Research

# Login using earthdata
from earthdata import Auth, DataGranules, DataCollections, Store
import os.path

auth = Auth()

# For Githhub CI, we can use ~/.netrc
if os.path.isfile(os.path.expanduser('~/.netrc')):
    auth.login(strategy='netrc')
else:
    auth.login(strategy='interactive')

You're now authenticated with NASA Earthdata Login

auth._session

<earthdata.auth.SessionWithHeaderRedirection at 0x7f86e0811550>

# Find some ICESat-2 ATL08 granules and display them
granules = DataGranules().short_name('ATL08').bounding_box(-10,20,10,50).get(5)
[display(g) for g in granules[0:5]]

[None, None, None, None, None]

## Check if these are hosted on the cloud
granules[0].cloud_hosted

False

import glob

## Download some files
atl08_dir = '/tmp/demo-atl08'
store = Store(auth)
store.get(granules[0:3], atl08_dir)

 Getting 3 granules, approx download size: 0.33 GB

import h5py

## Open one of them
files = glob.glob(f'{atl08_dir}/*.h5')
ds = h5py.File(files[0], 'r')
ds

<HDF5 file "ATL08_20181014034354_02370106_004_01.h5" (mode r)>

import boto3
import rasterio as rio
from rasterio.session import AWSSession
import requests
import rioxarray
import os

def get_temp_creds(provider):
    return requests.get(s3_cred_endpoint[provider]).json()

s3_cred_endpoint = {
    'podaac':'https://archive.podaac.earthdata.nasa.gov/s3credentials',
    'gesdisc': 'https://data.gesdisc.earthdata.nasa.gov/s3credentials',
    'lpdaac':'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials',
    'ornldaac': 'https://data.ornldaac.earthdata.nasa.gov/s3credentials',
    'ghrcdaac': 'https://data.ghrc.earthdata.nasa.gov/s3credentials'
}

if os.path.isfile(os.path.expanduser('~/.netrc')):
    # For Githhub CI, we can use ~/.netrc 
    temp_creds_req = get_temp_creds('lpdaac')
else:
    # ADD temporary credentials here
    temp_creds_req = {}

session = boto3.Session(aws_access_key_id=temp_creds_req['accessKeyId'], 
                        aws_secret_access_key=temp_creds_req['secretAccessKey'],
                        aws_session_token=temp_creds_req['sessionToken'],
                        region_name='us-west-2')

# ADD S3 URL from Earthdata Search here
# Note you want to pick a GeoTIFF for the rioxarray code to work
s3_url = 's3://lp-prod-protected/HLSL30.020/HLS.L30.T58KEB.2022077T225645.v2.0/HLS.L30.T58KEB.2022077T225645.v2.0.SZA.tif'

# NOTE: Using rioxarray assumes you are accessing a GeoTIFF
rio_env = rio.Env(AWSSession(session),
                  GDAL_DISABLE_READDIR_ON_OPEN='TRUE',
                  GDAL_HTTP_COOKIEFILE=os.path.expanduser('~/cookies.txt'),
                  GDAL_HTTP_COOKIEJAR=os.path.expanduser('~/cookies.txt'))
rio_env.__enter__()

da = rioxarray.open_rasterio(s3_url)
da

from dask.distributed import Client

client = Client(n_workers=4,
                local_directory="/tmp/dask" # local scratch disk space
               )

client

from time import sleep

def inc(x):
    sleep(1)
    return x + 1

def add(x, y):
    sleep(1)
    return x + y

%%time
# This takes three seconds to run because we call each
# function sequentially, one after the other

x = inc(1)
y = inc(2)
z = add(x, y)
z

CPU times: user 109 ms, sys: 1.16 ms, total: 110 ms
Wall time: 3 s

5

from dask import delayed

%%time
# This runs immediately, all it does is build a graph

x = delayed(inc)(1)
y = delayed(inc)(2)
z = delayed(add)(x, y)

CPU times: user 1.04 ms, sys: 0 ns, total: 1.04 ms
Wall time: 893 µs

%%time
# This actually runs our computation using a local thread pool

z.compute()

CPU times: user 216 ms, sys: 26.7 ms, total: 242 ms
Wall time: 2.18 s

5

z

Delayed('add-3477208d-8b6d-4636-93e2-11d0aa7ca6c9')

z.visualize()

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 6.78 GiB
Status: running	Using processes: True

Comm: tcp://127.0.0.1:35127	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 6.78 GiB

Comm: tcp://127.0.0.1:36809	Total threads: 1
Dashboard: http://127.0.0.1:40411/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:43947
Local directory: /tmp/dask/dask-worker-space/worker-_ta8evn7

Comm: tcp://127.0.0.1:34183	Total threads: 1
Dashboard: http://127.0.0.1:44433/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:40069
Local directory: /tmp/dask/dask-worker-space/worker-nck1_coe

Comm: tcp://127.0.0.1:43331	Total threads: 1
Dashboard: http://127.0.0.1:46407/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:36049
Local directory: /tmp/dask/dask-worker-space/worker-eyfm9c03

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Comm: tcp://127.0.0.1:46201	Total threads: 1
Dashboard: http://127.0.0.1:32957/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:45545
Local directory: /tmp/dask/dask-worker-space/worker-6qkywasi

ICESat-2 Hackweek 2022

Cloud Computing Tutorial

Contents

Cloud Computing Tutorial¶

Icebreaker questions¶

Learning Objectives¶

Key Takeaways¶

Sections¶

Setup: Getting prepared for what’s coming!!!¶

1. Local vs Remote Resources¶

❓🤔❓ Question for the group:¶

What types of resources are available on this machine?¶

CPUs¶

Memory¶

🏋️ Exercise: How many CPUs does your machine have¶

❓🤔❓Question for the group¶

2. Data on the Cloud vs On-premise¶

What’s the difference between data hosted on the cloud and on-prem data centers?¶

🏋️ Exercise¶

What did we learn?¶

Final thought: Other cloud data providers¶

3. How to access NASA data¶

Earthdata Login for Access¶

🏋️ Exercise 1: Access via Earthdata Login using ~/.netrc¶

🏋️ Exercise 2: Use the earthdata library to access ICESat-2 data “on-premise” at NSIDC¶

❓🤔❓ Question for the group¶

🏋️ Exercise 3: Access COG data from S3 using Earthdata Search¶

What did we learn?¶

❓🤔❓ Question for the Group¶

Final Thoughts¶

4. Tools for cloud computing¶

🏋️ Exercise - Using Dask to parallelize computing¶

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Basics¶

Parallelize with the dask.delayed decorator¶

What just happened?¶

Demonstration - Dask¶

❓🤔❓ Question for the group¶

🎁 Bonus¶

Want to learn more?¶

Resources¶

Examples + Tutorials of cloud computing¶

ICESat-2¶

Collections of tutorials¶

Introduction to NASA Earthdata Cloud from PO.DAAC¶

Tools for subsetting¶

About COGs¶

🏋️ Exercise 1: Access via Earthdata Login using `~/.netrc`¶

🏋️ Exercise 2: Use the `earthdata` library to access ICESat-2 data “on-premise” at NSIDC¶