Cloud Computing Tutorial

For ICESat-2 UW Hackweek 2022

cloud bouncing and smiling

Icebreaker questions

Enter your answers in chat or in your own text editor of choice.

  • Is everyone who wants to be logged into their jupyterhub and have the notebook open?

  • When you hear the term “cloud computing” what’s the first thing that comes to mind?

  • What concepts or tools are you hoping to learn more about in this tutorial?


Learning Objectives

  1. The difference between code running on your local machine vs a remote environment.

  2. The difference between data hosted by cloud providers, like AWS, and on-prem data centers

  3. The difference between how to access data hosted by NASA DAACs, on-prem, cloud/s3

  4. Cloud computing tools for scaling your science

Key Takeaways

  • At least one tutorial (or tool) to try


Sections

  1. Local vs Remote Resources

  2. Data on the Cloud vs On-premise

  3. How to access NASA data

  4. Tools for cloud computing: Brief introduction to Dask


Setup: Getting prepared for what’s coming!!!

  • Each section includes some key learning points and at least 1 exercise.

  • Configure your screens so you can see both the tutorial and your jupyterhub to follow along.

  • Let’s all login into https://urs.earthdata.nasa.gov/home before we get started.

  • The bottom lists many other references for revisiting.

  • Let’s install some libraries for this tutorial.


1. Local vs Remote Resources

❓🤔❓ Question for the group:

What’s the difference between running code on your local machine this remote jupyterhub?

As you are probably aware, this code is running on machine somewhere in Amazon Web Services (AWS) land.

aws data centers

What types of resources are available on this machine?

CPUs

The central processing unit (CPU) or processor, is the unit which performs most of the processing inside a computer. It processes all instructions received by software running on the PC and by other hardware components, and acts as a powerful calculator. Source: techopedia.com

# How many CPUs are running on this machine?
!lscpu | grep ^CPU\(s\):
CPU(s):                          2

Memory

Computer random access memory (RAM) is one of the most important components in determining your system’s performance. RAM gives applications a place to store and access data on a short-term basis. It stores the information your computer is actively using so that it can be accessed quickly. Source: crucial.com

# How much memory is available?
!free -h
              total        used        free      shared  buff/cache   available
Mem:          6.8Gi       851Mi       4.0Gi       9.0Mi       2.0Gi       5.6Gi
Swap:         4.0Gi        11Mi       4.0Gi

If you’re curious about the difference between free and available memory: https://haydenjames.io/free-vs-available-memory-in-linux

🏋️ Exercise: How many CPUs does your machine have

Unless you are using a linux machine, the above commands probably won’t give you what you need.

  1. For MAC users: sysctl -a | grep cpu | grep hw or https://www.linux.com/training-tutorials/5-commands-checking-memory-usage-linux/

  2. For Windows users: https://www.top-password.com/blog/find-number-of-cores-in-your-cpu-on-windows-10/ (Not tested)

❓🤔❓Question for the group

When might you want to use a remote machine and when might you want to use your local machine?


2. Data on the Cloud vs On-premise

What’s the difference between data hosted on the cloud and on-prem data centers?

NASA Distributed Active Archive Centers (DAACs)

NASA DAACs are in the process of migrating their collections to the “Earthdata Cloud”. At this time, most datasets are still located and accessible “on-premise” from NASA DAACs, while high priority and new datasets are being stored on AWS Simple Storage Service (S3). Given different use cases, you will need to access datasets from NASA DAAC’s as well as on NASA’s Earthdata Cloud (AWS S3).

  • Datasets are still managed by the DAAC, but the DAAC stores files on AWS S3.

  • The DAACs’ services will be collocoated in the cloud with the data.

  • Users are encouraged to access the data collocated in the cloud through AWS computer services (like this jupyterhub!)

🏋️ Exercise

Navigate search.earthdata.nasa.gov and search for ICESat-2 and answer the following questions:

  1. Which DAAC hosts ICESat-2 datasets?

  2. Which ICESat-2 datasets are hosted on the AWS Cloud and how can you tell?

What did we learn?

NASA has a new cloud paradigm, which includes data stored both on-premise as well as on the cloud. NASA DAACs are providing services also on AWS.

PO.DAAC has a great diagram for this new paradigm, source https://podaac.jpl.nasa.gov/cloud-datasets/about

Cloud_ecosystem_diagram

Final thought: Other cloud data providers

AWS is of course not the only cloud provider and Earthdata can be found on other popular cloud providers.


3. How to access NASA data

How do we access data hosted on-prem and on the cloud? What are some tools we can use?

Earthdata Login for Access

NASA uses Earthdata Login to authenticate users requesting data to track usage. You must supply EDL credentials to access all data hosted by NASA. Some data is available to all EDL users and some is restricted.

You can access data from NASA using ~/.netrc files locally which store your EDL credentials.

🏋️ Exercise 1: Access via Earthdata Login using ~/.netrc

The Openscapes Tutorial 04. Authentication for NASA Earthdata offers an excellent quick tutorial on how to create a ~/.netrc file.

  • Exercise: Review the tutorial and answer the following question: Why might you want to be careful running this code in a shared jupyterhub environment?

  • Takehome exercise: Run through the code on your local machine

🏋️ Exercise 2: Use the earthdata library to access ICESat-2 data “on-premise” at NSIDC

Programmatic access of NSIDC data can happen in 2 ways:

Search -> Download -> Process -> Research
https://raw.githubusercontent.com/NASA-Openscapes/earthdata-cloud-cookbook/main/examples/NSIDC/img/download-model.png
Search -> Process in the cloud -> Research
https://raw.githubusercontent.com/NASA-Openscapes/earthdata-cloud-cookbook/main/examples/NSIDC/img/cloud-model.png

Credit: Open Architecture for scalable cloud-based data analytics. From Abernathey, Ryan (2020): Data Access Modes in Science.

For this exercise, we are going to use NSIDC’s earthdata python library to find and download ATL08 files from NSIDC DAAC via HTTPS.

# Login using earthdata
from earthdata import Auth, DataGranules, DataCollections, Store
import os.path

auth = Auth()

# For Githhub CI, we can use ~/.netrc
if os.path.isfile(os.path.expanduser('~/.netrc')):
    auth.login(strategy='netrc')
else:
    auth.login(strategy='interactive')
You're now authenticated with NASA Earthdata Login

Earthdata library uses a session so credentials are not stored in files.

auth._session
<earthdata.auth.SessionWithHeaderRedirection at 0x7f86e0811550>
# Find some ICESat-2 ATL08 granules and display them
granules = DataGranules().short_name('ATL08').bounding_box(-10,20,10,50).get(5)
[display(g) for g in granules[0:5]]

Data: https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL08.004/2018.10.14/ATL08_20181014034354_02370106_004_01.h5

Size: 114.3358402252 MB

Spatial: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -174.28814360881728, 'StartLatitude': 59.5, 'StartDirection': 'D', 'EndLatitude': 27.0, 'EndDirection': 'D'}}}

Data PreviewData Preview

Data: https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL08.005/2018.10.14/ATL08_20181014034354_02370106_005_01.h5

Size: 118.2040967941 MB

Spatial: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -174.28814360881728, 'StartLatitude': 59.5, 'StartDirection': 'D', 'EndLatitude': 27.0, 'EndDirection': 'D'}}}

Data PreviewData Preview

Data: https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL08.004/2018.10.14/ATL08_20181014035224_02370107_004_01.h5

Size: 100.266831398 MB

Spatial: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -174.28814360881728, 'StartLatitude': 27.0, 'StartDirection': 'D', 'EndLatitude': 0.0, 'EndDirection': 'D'}}}

Data PreviewData Preview

Data: https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL08.005/2018.10.14/ATL08_20181014035224_02370107_005_01.h5

Size: 103.6830654144 MB

Spatial: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -174.28814360881728, 'StartLatitude': 27.0, 'StartDirection': 'D', 'EndLatitude': 0.0, 'EndDirection': 'D'}}}

Data PreviewData Preview