Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename, fix, and extend NAWQA (NWQN) demo #153

Merged
merged 7 commits into from
Aug 30, 2024

Conversation

thodson-usgs
Copy link
Collaborator

The National Watern Quality Network (NWQN) demo uses AWS serverless to search and pull all NWQN data into an S3 bucket. This PR makes some fixes to the demo and incorporates streamflow.

For context, this is an advanced usage example, which does not currently appear in the doc page. Nevertheless, I host it in the repo for instructing others, but also for helping us to scope development of dataretrieval more generally. These pipelines stress several endpoints and help us expose failure modes that appear when we scale up our workflows.

@thodson-usgs thodson-usgs changed the title Rename, fix, and extend NAQWA (NWQN) demo Rename, fix, and extend NAWQA (NWQN) demo Aug 20, 2024
Copy link
Collaborator

@ehinman ehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff! I was able to run the scripts by setting lithops to the LocalhostExecutor, and I used the small "testing" site list. I successfully downloaded several parquet files, though to be honest I haven't gotten to re-opening them and understanding how they're structured.

It would be nice to overall see more documentation of the code lines and different functions. Though I was able to (mostly) figure out what the code does, this is my first exposure to some of these functions, and I could've gotten to the point faster if there was more narrative on what was going on (for example, I've never seen the exponential_backoff method to improve API call handling, and I'd like to know more about the mapping functionality in lithops...is it any more complicated than an "apply" function?).

Do you plan to "fill in" water quality values in some way, similar to streamflow?

Overall, very cool example of using dataretrieval-python with larger data calls.


This examples walks through using lithops to retrieve data from every NAWQA
This examples walks through using lithops to retrieve data from every NWQN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This examples walks through using lithops to retrieve data from every NWQN
This example walks through using lithops to retrieve data from every NWQN


This examples walks through using lithops to retrieve data from every NAWQA
This examples walks through using lithops to retrieve data from every NWQN
monitoring site, then writes the results to a parquet files on s3. Each
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
monitoring site, then writes the results to a parquet files on s3. Each
monitoring site, then writes the results to a parquet file on s3. Each

python retrieve_nawqa_with_lithops.py
python retrieve_nwqn_samples.py

python retrieve_nwqn_streamflow.py
```

## Cleaning up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo: lithops

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️

@@ -32,9 +34,11 @@ wget https://www.sciencebase.gov/catalog/file/get/655d2063d34ee4b6e05cc9e6?f=__d
export DESTINATION_BUCKET=<path/to/bucket>
```

1. Run the script
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to comment on unchanged lines, but this refers to line 27: I didn't know I needed to download wget (either in bash or pip install via python) before downloading the sciencebase data using that method. Add a note about it, perhaps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. That will be system-dependent, but I noted that alternatively you can navigate to the url to download the file.

attempts += 1
if attempts > max_retries:
raise e
wait_time = base_delay * (2 ** attempts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I follow to this point: are you making it so that with every failed attempt, the wait time increases exponentially between attempts (until max_retries is satisified)? Might be helpful to add a comment here.

# compute fraction of drainage area
site_info = site_info[["drain_area_va"]].copy()
site_info["drain_fraction"] = site_info["drain_area_va"] / main_site_drainage_area
site_info["fraction_diff"] = np.abs(1 - site_info["drain_fraction"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any cutoff(s) for drain fraction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that imposed by the neighborhood search.


output = pd.DataFrame()

# loop through sites, updating the best data first
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"best data" meaning the site with the most similar drainage area?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, revised

output = update_dataframe(output, fill_data)

output = output.drop(columns=["drain_area_va", "drain_fraction", "fraction_diff"])
output["site_no"] = site
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should output include metadata indicating when data are from a different basin?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for the demo.


output = output.drop(columns=["drain_area_va", "drain_fraction", "fraction_diff"])
output["site_no"] = site

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running this function line-by-line for site '01170095', which had 3 nearby sites: ['01170070', '01170095', '01170100', '01170103']. It appears that the only site with data was '01170070', so that site was wholly used to fill in for site '01170095'. Is this the desired outcome?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

# loop through sites, updating the best data first
for fill_site in fill_order:
fill_data = df.loc[fill_site]
output = update_dataframe(output, fill_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snazzy!!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing a bug in update_dataframe(). Bots wrote most of this code, so it might take a few iterations to find the bugs.

@thodson-usgs
Copy link
Collaborator Author

Do you plan to "fill in" water quality values in some way, similar to streamflow?

That process is similar, except we don't rescale by drainage area. For the moment, I think the demo is a successful proof-of-concept for using the NLDI to aggregate data. In practice, we would want several more steps, but I have no plans to implement them at the moment.

Otherwise, this is more or less working as intended and I've addressed the review comments.

Copy link
Collaborator

@ehinman ehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@thodson-usgs thodson-usgs merged commit aa48ad6 into DOI-USGS:master Aug 30, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants