Cloud Computing Tutorial
Contents
Icebreaker questions¶
Enter your answers in chat or in your own text editor of choice.
Is everyone who wants to be logged into their jupyterhub and have the notebook open?
When you hear the term “cloud computing” what’s the first thing that comes to mind?
What concepts or tools are you hoping to learn more about in this tutorial?
Learning Objectives¶
The difference between code running on your local machine vs a remote environment.
The difference between data hosted by cloud providers, like AWS, and on-prem data centers
The difference between how to access data hosted by NASA DAACs, on-prem, cloud/s3
Cloud computing tools for scaling your science
Key Takeaways¶
At least one tutorial (or tool) to try
Sections¶
Local vs Remote Resources
Data on the Cloud vs On-premise
How to access NASA data
Tools for cloud computing: Brief introduction to Dask
Setup: Getting prepared for what’s coming!!!¶
Each section includes some key learning points and at least 1 exercise.
Configure your screens so you can see both the tutorial and your jupyterhub to follow along.
Let’s all login into https://urs.earthdata.nasa.gov/home before we get started.
The bottom lists many other references for revisiting.
Let’s install some libraries for this tutorial.
1. Local vs Remote Resources¶
❓🤔❓ Question for the group:¶
What’s the difference between running code on your local machine this remote jupyterhub?
As you are probably aware, this code is running on machine somewhere in Amazon Web Services (AWS) land.
![aws data centers](../../_images/aws-data-centers.png)
What types of resources are available on this machine?¶
CPUs¶
The central processing unit (CPU) or processor, is the unit which performs most of the processing inside a computer. It processes all instructions received by software running on the PC and by other hardware components, and acts as a powerful calculator. Source: techopedia.com
# How many CPUs are running on this machine?
!lscpu | grep ^CPU\(s\):
CPU(s): 2
Memory¶
Computer random access memory (RAM) is one of the most important components in determining your system’s performance. RAM gives applications a place to store and access data on a short-term basis. It stores the information your computer is actively using so that it can be accessed quickly. Source: crucial.com
# How much memory is available?
!free -h
total used free shared buff/cache available
Mem: 6.8Gi 851Mi 4.0Gi 9.0Mi 2.0Gi 5.6Gi
Swap: 4.0Gi 11Mi 4.0Gi
If you’re curious about the difference between free and available memory: https://haydenjames.io/free-vs-available-memory-in-linux
🏋️ Exercise: How many CPUs does your machine have¶
Unless you are using a linux machine, the above commands probably won’t give you what you need.
For MAC users:
sysctl -a | grep cpu | grep hw
or https://www.linux.com/training-tutorials/5-commands-checking-memory-usage-linux/For Windows users: https://www.top-password.com/blog/find-number-of-cores-in-your-cpu-on-windows-10/ (Not tested)
❓🤔❓Question for the group¶
When might you want to use a remote machine and when might you want to use your local machine?
2. Data on the Cloud vs On-premise¶
What’s the difference between data hosted on the cloud and on-prem data centers?¶
NASA DAACs are in the process of migrating their collections to the “Earthdata Cloud”. At this time, most datasets are still located and accessible “on-premise” from NASA DAACs, while high priority and new datasets are being stored on AWS Simple Storage Service (S3). Given different use cases, you will need to access datasets from NASA DAAC’s as well as on NASA’s Earthdata Cloud (AWS S3).
Datasets are still managed by the DAAC, but the DAAC stores files on AWS S3.
The DAACs’ services will be collocoated in the cloud with the data.
Users are encouraged to access the data collocated in the cloud through AWS computer services (like this jupyterhub!)
🏋️ Exercise¶
Navigate search.earthdata.nasa.gov and search for ICESat-2 and answer the following questions:
Which DAAC hosts ICESat-2 datasets?
Which ICESat-2 datasets are hosted on the AWS Cloud and how can you tell?
What did we learn?¶
NASA has a new cloud paradigm, which includes data stored both on-premise as well as on the cloud. NASA DAACs are providing services also on AWS.
PO.DAAC has a great diagram for this new paradigm, source https://podaac.jpl.nasa.gov/cloud-datasets/about
Final thought: Other cloud data providers¶
AWS is of course not the only cloud provider and Earthdata can be found on other popular cloud providers.
AWS also has its public data registry Open Data on AWS and its sustainability data initiative with its Registry of Open Data on AWS: Sustainability Data Initiative
3. How to access NASA data¶
How do we access data hosted on-prem and on the cloud? What are some tools we can use?
Earthdata Login for Access¶
NASA uses Earthdata Login to authenticate users requesting data to track usage. You must supply EDL credentials to access all data hosted by NASA. Some data is available to all EDL users and some is restricted.
You can access data from NASA using ~/.netrc files locally which store your EDL credentials.
🏋️ Exercise 1: Access via Earthdata Login using ~/.netrc
¶
The Openscapes Tutorial 04. Authentication for NASA Earthdata offers an excellent quick tutorial on how to create a ~/.netrc file.
Exercise: Review the tutorial and answer the following question: Why might you want to be careful running this code in a shared jupyterhub environment?
Takehome exercise: Run through the code on your local machine
🏋️ Exercise 2: Use the earthdata
library to access ICESat-2 data “on-premise” at NSIDC¶
Programmatic access of NSIDC data can happen in 2 ways:
Search -> Download -> Process -> Research
![https://raw.githubusercontent.com/NASA-Openscapes/earthdata-cloud-cookbook/main/examples/NSIDC/img/download-model.png](https://raw.githubusercontent.com/NASA-Openscapes/earthdata-cloud-cookbook/main/examples/NSIDC/img/download-model.png)
Search -> Process in the cloud -> Research
![https://raw.githubusercontent.com/NASA-Openscapes/earthdata-cloud-cookbook/main/examples/NSIDC/img/cloud-model.png](https://raw.githubusercontent.com/NASA-Openscapes/earthdata-cloud-cookbook/main/examples/NSIDC/img/cloud-model.png)
Credit: Open Architecture for scalable cloud-based data analytics. From Abernathey, Ryan (2020): Data Access Modes in Science.
For this exercise, we are going to use NSIDC’s earthdata python library to find and download ATL08 files from NSIDC DAAC via HTTPS.
# Login using earthdata
from earthdata import Auth, DataGranules, DataCollections, Store
import os.path
auth = Auth()
# For Githhub CI, we can use ~/.netrc
if os.path.isfile(os.path.expanduser('~/.netrc')):
auth.login(strategy='netrc')
else:
auth.login(strategy='interactive')
You're now authenticated with NASA Earthdata Login
Earthdata library uses a session so credentials are not stored in files.
auth._session
<earthdata.auth.SessionWithHeaderRedirection at 0x7f86e0811550>
# Find some ICESat-2 ATL08 granules and display them
granules = DataGranules().short_name('ATL08').bounding_box(-10,20,10,50).get(5)
[display(g) for g in granules[0:5]]
Size: 114.3358402252 MB
Spatial: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -174.28814360881728, 'StartLatitude': 59.5, 'StartDirection': 'D', 'EndLatitude': 27.0, 'EndDirection': 'D'}}}
Size: 118.2040967941 MB
Spatial: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -174.28814360881728, 'StartLatitude': 59.5, 'StartDirection': 'D', 'EndLatitude': 27.0, 'EndDirection': 'D'}}}
Size: 100.266831398 MB
Spatial: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -174.28814360881728, 'StartLatitude': 27.0, 'StartDirection': 'D', 'EndLatitude': 0.0, 'EndDirection': 'D'}}}
Size: 103.6830654144 MB
Spatial: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -174.28814360881728, 'StartLatitude': 27.0, 'StartDirection': 'D', 'EndLatitude': 0.0, 'EndDirection': 'D'}}}