<img src='./img/nsidc_logo.png'/>

# Data Discovery and Access via **earthdata** library


**Credits**
* Notebook by: Luis Lopez and Mikala Beig
* Source material: [earthdata demo notebook](https://github.com/nsidc/earthdata) by Luis Lopez

## Objective

* Use programmatic data access to discover and access NASA DAAC data using the **earthdata** library .

---

## Motivation and Background
TL;DR **earthdata**  uses NASA APIs to search, preview and access NASA datasets on-prem and in the cloud (with 4 lines of Python!).
___

There are many ways to access NASA datasets. We can use the Earthdata search portal. We can use DAAC specific portals or tools.
We can use Open Altimetry. These web portals are powerful but... they are not designed for programmatic access and reproducible workflows. 
This is extremely important in the age of the cloud and reproducible open science.

The good news is that NASA also exposes APIs that allows us to search, transform and access data in a programmatic way. 
There are already some very useful client libraries for these APIs:

* icepyx
* python-cmr
* eo-metadata-tools
* harmony-py
* Hyrax (OpenDAP)
* cmr-stac
* others

Each of these libraries has amazing features and some similarities. 

In this context, **earthdata** aims to be a simple library that can deal with the important parts of the metadata so we can access or download data without having to worry if a given dataset is on-premises (DAAC server) or in the cloud.  **earthdata** is a work in progress and improving often.  You are encouraged to contribute to this [opensource library](https://github.com/nsidc/earthdata). 

Some important strengths of earthdata library:
* Discovery and access to on prem and cloud-hosted data
* Access to data across all of NASA DAACs.
* Easy handling of S3 credentials for direct access to cloud-hosted data.

## Key Steps for Programmatic Data Access

There are a few key steps for accessing data from the NASA DAAC APIs:
1. Authenticate with NASA Earthdata Login (and for cloud-hosted data with AWS access keys and token).
2. Query CMR to find data using filters, such as spatial extent and temporal range.
3. Order and download your data by interacting with DAAC APIs.


We'll go through each of these steps during this tutorial, at the end summarizing how `earthdata` streamlines this process into a minimal number of lines of code.
___

### **Step 0. Import classes**

In [None]:
# Import classes from earthdata

from earthdata import Auth, DataCollections, DataGranules, Store

### **Step 1. Earthdata login**

To access data using the <library name> library it is necessary to log into [Earthdata Login](https://urs.earthdata.nasa.gov/). To do this, enter your NASA Earthdata credentials in the next step after executing the following code cell.

**Note**: If you don't have NASA Earthdata credentials you have to register first at the link above. You don't need to be a NASA employee to register with NASA Earthdata!  Note that if you did not enter your Earthdata Login username and email into the form in the pre-Hackweek email, you will not be on the ICESat-2 cloud data early access list and you will not have access to ICESat-2 data in the cloud.  You will still have access to all publicly available data sets.


In [None]:
#Entering our Earthdata Login credentials.  

auth = Auth().login(strategy='netrc')
if auth.authenticated is False:
    auth = Auth().login(strategy='interactive')

### **Step 2. Query the Common Metadata Repository (CMR)**

#### Query CMR for Data Collections

The DataCollection class can query CMR for any collection (collection = data set) using all of CMR's Query parameters and has built-in accessors for the common ones.
This makes it ideal for one liners and easier notation.

This means we can narrow our search in CMR by filtering on keyword, temporal range, area of interest, and data provider, e.g.:
- temporal("2020-03-01", "2020-03-30")
- keyword('land ice')
- bounding_box(-134.7,58.9,-133.9,59.2)
- provider("NSIDC_ECS")


We're going to go through a couple of examples of querying CMR and accessing data - the first for accessing on prem data and the second for accessing cloud-hosted data.

The first thing we'll do is set up our query object.

In [None]:
Query = DataCollections().keyword('land ice').bounding_box(-134.7,58.9,-133.9,59.2).provider("NSIDC_ECS")

# Query = DataCollections().keyword('land ice').bounding_box(-134.7,58.9,-133.9,59.2).daac("NSIDC")
# Query = DataCollections().keyword('land ice').bounding_box(-134.7,58.9,-133.9,59.2).data_center("NSIDC")

print(f'Collections found: {Query.hits()}')


Then we'll create a collections object from our query.

In [None]:

collections = Query.get(10)

# Inspect 1st result.

print(collections[0:1])

To reduce the number of metadata fields displayed, we can select which fields to print when creating our collections object.

In [None]:
collections = Query.fields(['ShortName','Abstract']).get(5)

# Inspect 5 results printing just the ShortName and Abstract

print(collections[0:5])

The results from DataCollections are enhanced python dict objects.  We can select which metadata fields from CMR to display.

The concept ID is an important parameter in CMR.  It's a unique identifier for a data collection (collection = data set).  We'll use the concept ID when querying for data granules (granules = files) below.

In [None]:
collections[0]["meta.concept-id"]

In [None]:
collections[0]["umm.ShortName"]

In [None]:
collections[0]["umm.RelatedUrls"]

#### Query CMR for Data Granules

The DataGranules class provides similar functionality as the collection class. As mentioned above, concept IDs are unique identifiers for data sets (collections). To query for granules from the exact data set and version in which you are interested, query granules using concept-id.
You can search data granules using a short name but that could (more likely will) return different versions of the same data granules. Even when specifying both short name and version number, a query won't distinguish between on prem or cloud hosted granules.

In this example we're querying for data granules from ICESat-2  [ATL06](https://nsidc.org/data/ATL06) version `005` dataset. 

In [None]:
# Generally speaking we won't need the auth instance for *queries* to collections and granules, unless the data set is under restricted access (like NSIDC_CPRD).

Query = DataGranules().concept_id('C2144439155-NSIDC_ECS').bounding_box(-134.7,58.9,-133.9,59.2).temporal("2020-03-01", "2020-03-30")
print(f'Granules found: {Query.hits()}')


In [None]:
granules = Query.get()
print(granules[0:4])

#### Pretty printing data granules

Since we are in a notebook we can take advantage of it to see a more user friendly version of the granules with the built-in function `display`
This will render browse image for the granule if available and eventually will have a similar representation as the one from the Earthdata search portal

In [None]:
# printing 2 granules using display
[display(granule) for granule in granules[0:2]]

### **Step 3. Accessing the data**

#### On-prem access  üì°

DAAC hosted data

In [None]:
%%time
# accessing the data on prem means downloading it if we are in a local environment or "uploading them" if we are in the cloud.
access = Store(auth)
files = access.get(granules[1:2], local_path = "/tmp/demo-atl06")

In a terminal, "ls /tmp" to see where the files are going.

#### Cloud access ‚òÅÔ∏è

Same API, just a different place

The cloud is not something magical, but having infrastructure on-demand is quite handy to have for many scientific workflows, especially if the data already lives in "the cloud".
As for NASA, data migration started in 2020 and will continue into the foreseeable future. Not all, but most of NASA data will be available in Amazon Web Services object simple storage service or AWS S3.

To work with this data the first thing we need to do is to get the proper credentials for accessing data in their S3 buckets. These credentials are on a per-DAAC basis and last a mere 1 hour. In the near future the Auth class will keep track of this to regenerate the credentials as needed.

With `earthdata` a researcher can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want direct access to data in the cloud we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, `us-west-2`.


In [None]:
Query = DataCollections(auth).keyword('land ice').bounding_box(-134.7,58.9,-133.9,59.2).provider("NSIDC_CPRD")

print(f'Collections found: {Query.hits()}') 

Oh no!  What!?  Zero hits? :(   

The 'hits' method above will tell you the number of query hits, but only for publicly available data sets.  
Because cloud hosted ICESat-2 data are not yet publicly available, CMR will return "0" hits, if you filtered DataCollections by provider = NSIDC_CPRD.
For now we need an alternative method of seeing how many cloud data sets are available at NSIDC.  This is only temporary until cloud-hosted ICESat-2 become publicly available. We can create a collections object (we're going to want one of 
these soon anyhow) and print the len() of the collections object to see the true number of hits.

Create a collections object

In [None]:
# We can create a collections object from our query.

collections = Query.fields(['ShortName','Abstract']).get()

print(len(collections))


In [None]:
# Inspect 1st result.

print(collections[0:5])

In [None]:
Query = DataGranules(auth).concept_id("C2153572614-NSIDC_CPRD").bounding_box(-134.7,58.9,-133.9,59.2).temporal("2020-03-01", "2020-03-30")
print(f"Granule hits: {Query.hits()}")
cloud_granules = Query.get(4)
print(len(cloud_granules))
# is this a cloud hosted data granule?
cloud_granules[0].cloud_hosted

In [None]:
%%time

# If we get an error here, most likely is because we are running this code outside the us-west-2 region.
try:
    files = access.get(cloud_granules[0:2], "/tmp/demo-NSIDC_CPRD/")
except Exception as e:
    print("If we are here maybe we are not in us-west-2 or the collection ")

## Recap

```python
from earthdata import Auth, DataGranules, DataCollections, Store
auth = Auth().login()
access = Store(auth)

Query = DataGranules(auth).concept_id("C2144439155-NSIDC_ECS").bounding_box(-134.7,58.9,-133.9,59.2).temporal("2020-03-01", "2020-03-30")
granules = Query.get()
files = access.get(granules)
```

**Wait, we said 4 lines of Python**

```python
from earthdata import Auth, DataGranules, Store
auth = Auth().login()
granules = DataGranules(auth).concept_id("C2144439155-NSIDC_ECS").bounding_box(-134.7,58.9,-133.9,59.2).temporal("2020-03-01", "2020-03-30").get_all()
files = Store(auth).get(granules, '/tmp')
```

The Demo notebook in the [earthdata library GitHub repo](https://github.com/nsidc/earthdata) showcases much more of earthdata's capabilities, including many handy methods for querying CMR for collections and granules.  Please take a look on your own when you are ready to start using earthdata library.  You are invited to contribute! 

Data provider ID cheat sheet.

<img src='./img/data_provider_cheat_sheet.png'/>

### Related links

**CMR** API documentation: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

**EDL** API documentation: https://urs.earthdata.nasa.gov/

NASA OpenScapes: https://nasa-openscapes.github.io/earthdata-cloud-cookbook/

NSIDC: https://nsidc.org