Rename, fix, and extend NAWQA (NWQN) demo #153

thodson-usgs · 2024-08-20T21:31:52Z

The National Watern Quality Network (NWQN) demo uses AWS serverless to search and pull all NWQN data into an S3 bucket. This PR makes some fixes to the demo and incorporates streamflow.

For context, this is an advanced usage example, which does not currently appear in the doc page. Nevertheless, I host it in the repo for instructing others, but also for helping us to scope development of dataretrieval more generally. These pipelines stress several endpoints and help us expose failure modes that appear when we scale up our workflows.

demos/nwqn_data_pull/lithops.yaml

ehinman

Cool stuff! I was able to run the scripts by setting lithops to the LocalhostExecutor, and I used the small "testing" site list. I successfully downloaded several parquet files, though to be honest I haven't gotten to re-opening them and understanding how they're structured.

It would be nice to overall see more documentation of the code lines and different functions. Though I was able to (mostly) figure out what the code does, this is my first exposure to some of these functions, and I could've gotten to the point faster if there was more narrative on what was going on (for example, I've never seen the exponential_backoff method to improve API call handling, and I'd like to know more about the mapping functionality in lithops...is it any more complicated than an "apply" function?).

Do you plan to "fill in" water quality values in some way, similar to streamflow?

Overall, very cool example of using dataretrieval-python with larger data calls.

ehinman · 2024-08-22T13:20:13Z

demos/nwqn_data_pull/README.md


-This examples walks through using lithops to retrieve data from every NAWQA
+This examples walks through using lithops to retrieve data from every NWQN


Suggested change

This examples walks through using lithops to retrieve data from every NWQN

This example walks through using lithops to retrieve data from every NWQN

ehinman · 2024-08-22T13:20:54Z

demos/nwqn_data_pull/README.md


-This examples walks through using lithops to retrieve data from every NAWQA
+This examples walks through using lithops to retrieve data from every NWQN
 monitoring site, then writes the results to a parquet files on s3. Each


Suggested change

monitoring site, then writes the results to a parquet files on s3. Each

monitoring site, then writes the results to a parquet file on s3. Each

ehinman · 2024-08-22T13:29:14Z

demos/nwqn_data_pull/README.md

-python retrieve_nawqa_with_lithops.py
+python retrieve_nwqn_samples.py
+
+python retrieve_nwqn_streamflow.py
 ```

 ## Cleaning up


Small typo: lithops

ehinman · 2024-08-27T15:29:44Z

demos/nwqn_data_pull/README.md

@@ -32,9 +34,11 @@ wget https://www.sciencebase.gov/catalog/file/get/655d2063d34ee4b6e05cc9e6?f=__d
 export DESTINATION_BUCKET=<path/to/bucket>
 ```

-1. Run the script


I can't seem to comment on unchanged lines, but this refers to line 27: I didn't know I needed to download wget (either in bash or pip install via python) before downloading the sciencebase data using that method. Add a note about it, perhaps.

Hmm. That will be system-dependent, but I noted that alternatively you can navigate to the url to download the file.

ehinman · 2024-08-27T16:14:41Z

demos/nwqn_data_pull/retrieve_nwqn_samples.py

+                    attempts += 1
+                    if attempts > max_retries:
+                        raise e
+                    wait_time = base_delay * (2 ** attempts)


I think I follow to this point: are you making it so that with every failed attempt, the wait time increases exponentially between attempts (until max_retries is satisified)? Might be helpful to add a comment here.

ehinman · 2024-08-27T21:49:57Z