Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First crack at changing the url for one function #152

Merged
merged 20 commits into from
Sep 17, 2024

Conversation

ehinman
Copy link
Collaborator

@ehinman ehinman commented Aug 16, 2024

Created a new wqx3_url that is used if legacy=False, which is the default. I also added dataProfile as a kwarg (for this function at least, may not be needed for others). Thoughts on this approach? R dataRetrieval has a service argument where the user specifies if they want the old or the new profiles by name: https://github.com/DOI-USGS/dataRetrieval/blob/main/R/readWQPdata.R, but the functions in python do not let the user pick the service, and are instead specific to them (e.g. Results, Activity, Station, etc.).

Also removed zip since it's not a part of the new WQX3 url, but perhaps should leave it in for the legacy profiles and ensure it is switched to 'no'.

Copy link
Collaborator

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that seems reasonable


else:
url = wqx3_url('Result')
if 'dataProfile' not in kwargs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check whether dataProfile is among a list of valid profiles and throw an error if not. Examples in nwis.

dataretrieval/wqp.py Show resolved Hide resolved
@ehinman
Copy link
Collaborator Author

ehinman commented Aug 20, 2024

Well ok, this might be worse before it gets better. I don't really want to introduce a new function input just for dealing with legacy services. I think as long as the dataProfiles remain distinct between legacy and WQX3, users can specify which URL they want to grab from (and technically they should be adapting their workflows to the new format). However, what_sites is tricky because in this iteration, I've basically removed users' ability to get the legacy "Station" service. What's more, some of the other functions only access the legacy services at this point in time.

I would like to create a nice table in the documentation to show how users can grab different legacy and WQX3 profiles.

I'll also look into a column name crosswalk...I think Laura might have one.

Thoughts on all this? Eventually, we'd hopefully be able to simply remove all of the legacy kwarg options, but keep the functions essentially the same.

@thodson-usgs
Copy link
Collaborator

thodson-usgs commented Aug 21, 2024

I'll also look into a column name crosswalk...I think Laura might have one.

Might be unnecessary? wqp returns the raw column names and I don't think you want to crosswalk back to legacy, right?

I might try something like:

  1. create a new function for calling legacy, which issues a deprecation warning.
  2. within the standard function, set the default as legacy=True, that determines whether or not to call the legacy or the current profiles.
  3. eventually, add some functions to validate whether the user is passing valid params for a given profile.
  4. once the migration of all services is complete, set legacy=False, thereby changing the default behavior, at which point we would issue a major release.

@ehinman
Copy link
Collaborator Author

ehinman commented Aug 21, 2024

I like the idea of having wqp_url or a wrapper for it producing a warning. I'll add that in.

I'm trying to avoid adding a new input legacy to all of these functions (since someday it'll be useless), but rather leveraging the kwarg dataProfile, which can change with time but will hopefully remain a steady input. As long as a dataProfile name within a service is unique to legacy/wqx3, I think this will work....but see this table I made:

dr-py-ideas.xlsx

Regarding the column names, I'm not sure I follow. I'm thinking about users who have to change their coding workflow and want to see the new column names associated with WQX3, and how they map onto the legacy names. Showing them how to leverage this csv might be sufficient: https://www.epa.gov/system/files/other-files/2024-07/schema_outbound_wqx3.0.csv

@ehinman
Copy link
Collaborator Author

ehinman commented Aug 21, 2024

OK, after talking to @lstanish-usgs, I have added a new legacy input to the wqp functions. For a majority of functions, the only profile available is legacy right now. I need to add some new tests, but figured I'd get some reviews on the structure, first. After this PR is merged, we can then focus on a new function for the Samples Data API.

@ehinman ehinman marked this pull request as ready for review August 22, 2024 23:16
@ehinman ehinman self-assigned this Aug 27, 2024
@ehinman ehinman requested a review from jzemmels August 27, 2024 17:22
@ehinman
Copy link
Collaborator Author

ehinman commented Aug 30, 2024

@jzemmels Thanks for reviewing! A few things to check in your review:

  • Is the documentation clear on how the updated functions work? What else would be helpful to add?
  • Please try running these functions, pulling both legacy and wqx3 datasets using kwargs you can find in the web services guide. Do you run into any errors? Please note the error message.
  • Are the warning messages clear/sufficient on "stale" data pulls?

Let me know if you have additional questions. I would say bullet #2 is most important to review most thoroughly.

@thodson-usgs
Copy link
Collaborator

OK, after talking to @lstanish-usgs, I have added a new legacy input to the wqp functions.

Sounds good. Recommend setting legacy=True for now with deprecation warning. Next major release we will set to False or else create new functions.

@jzemmels
Copy link
Collaborator

jzemmels commented Aug 30, 2024

  • Is the documentation clear on how the updated functions work? What else would be helpful to add?

I'm satisfied with the documentation updates to indicate the new legacy parameter. When a user incorrectly specifies an argument like dataProfile that must be from a small list of options, it'd be great to print the list the user must choose from in the error message. For example, "dataProfile is not a legacy profile name. Select from ['resultPhysChem', 'biological', 'narrowResult']

Also, I'm not sure if I missed this, but a short definition of each dataProfile would be helpful in the package's documentation. The definitions are a bit tricky to find in the waterqualitydata.us documentation.

@jzemmels
Copy link
Collaborator

jzemmels commented Aug 30, 2024

  • Are the warning messages clear/sufficient on "stale" data pulls?

I think the message could start from the second sentence. The new legacy parameter implicitly communicates that some change occurred, so I don't think mentioning a change in the warning message is necessary.

Perhaps the warning message could also provide an alternative for users who want new data. E.g., "To fetch new data, set legacy = False and dataProfile to one of the following: ['fullPhysChem', 'narrow', 'basicPhysChem']"

Other than those two things, I think the warning message is good.

(Edit) Oh, one other thing: is there some way to automatically wrap the message so that it doesn't appear on one long line? Maybe this is just a VSCode thing...

@jzemmels
Copy link
Collaborator

  • Please try running these functions, pulling both legacy and wqx3 datasets using kwargs you can find in the web services guide. Do you run into any errors? Please note the error message.

I've checked a couple requests against what the WQP website returns and they work. I'm running into an issue where some requests that run quickly on the WQP website take a really long time to run in Python. Not sure what's causing it. I'm not done checking different combinations of kwargs, but I can check more once I return on Sept. 9.

@thodson-usgs
Copy link
Collaborator

thodson-usgs commented Sep 3, 2024

@jzemmels, near the top of the page, you should see a big green button labeled "Add your review". I recommend you use that process to create your review and suggest changes, rather than adding comments in the issue.

Also,

a short definition of each dataProfile would be helpful in the package's documentation. The definitions are a bit tricky to find in the waterqualitydata.us documentation.

Keep the function documentation concise. The doc should note that a full description of the profiles can be found at https://www.waterqualitydata.us/beta/portal_userguide/#data-profiles. The doc should also provide a list of valid profiles, either by writing them out or by referring the user to a convenience function.

...but let's continue this discussion using the normal review process.

@ehinman
Copy link
Collaborator Author

ehinman commented Sep 4, 2024

  • Are the warning messages clear/sufficient on "stale" data pulls?

I think the message could start from the second sentence. The new legacy parameter implicitly communicates that some change occurred, so I don't think mentioning a change in the warning message is necessary.

Perhaps the warning message could also provide an alternative for users who want new data. E.g., "To fetch new data, set legacy = False and dataProfile to one of the following: ['fullPhysChem', 'narrow', 'basicPhysChem']"

Other than those two things, I think the warning message is good.

(Edit) Oh, one other thing: is there some way to automatically wrap the message so that it doesn't appear on one long line? Maybe this is just a VSCode thing...

Thanks for these suggestions. I think I'm going to keep the warning as-is for the legacy services (because they're not specific to the Result service), but I added another sentence about setting legacy=False to get the latest profiles, if available.

Copy link
Collaborator

@jzemmels jzemmels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now tested out each function with multiple argument combinations and am pleased that nothing threw unexpected errors. The comments I've left are general usability/documentationcomments that can either be addressed here or filed under future considerations.

if legacy is True:
if 'dataProfile' in kwargs:
if kwargs['dataProfile'] not in result_profiles_legacy:
raise TypeError(f"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!

zip: string
Parameter to stream compressed data, if 'yes', or uncompressed data
if 'no'. Default is 'no'.
be 'geojson'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeoJSON is not yet supported, so this documentation should reflect that mimeType will automatically be changed to csv.

On a related note, is there a reason why the other possible WQP file outputs (tsv, xlsx) aren't supported here? I tested out passing mimeType="tsv" and "xlsx" to the query function and it returned the expected output. Seems like it would be a simple addition to expand the functionality.

Copy link
Collaborator Author

@ehinman ehinman Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Legacy supports tsv and xlsx, but it looks like beta does not yet support anything other than csv. I might add this as an issue to follow up on, but leave as-is for this PR. #162

Please choose from "fullPhysChem", "narrow", or "basicPhysChem"
""")
else:
kwargs['dataProfile'] = 'fullPhysChem'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genuine question: is this a reasonable default for dataProfile -- would the "typical" user want this to be set for them or should the function throw an error if the user doesn't set the dataProfile argument?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow whatever @ldecicco-USGS did

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R dataRetrieval uses the fullPhysChem profile. That is generally based on what the EPA TADA package uses and what provides the most comprehensive amount of USGS metadata.

if legacy is True:
url = wqp_url('Organization')
else:
print('No WQX3.0 profile currently available, returning legacy profile.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should throw an error if the user requests WQX3.0 data that aren't available rather than defaulting to the legacy profile (same point applies to the other endpoints that aren't available in WQX3.0).

At the very least, these warning messages could be more specific to the endpoint - for example, "The Organization data profile is not currently available in WQX3.0."

Describes the type of columns to return with the result dataset.
Most recent WQX3 profiles include 'fullPhysChem', 'narrow', and
'basicPhysChem'. Legacy profiles include 'resultPhysChem',
'biological', and 'narrowResult'.
siteid: string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these parameters be added to the documentation of the other WQP functions? Seems like it would be annoying to look at the get_results documentation to know how to use the what_* functions. Plus, the claim in the other functions that they "accept the same parameters" as get_results is not true; for example, none of them need the dataProfile argument. Copying + pasting the relevant parameters to the other functions' doc strings is easy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye. I'm not sure the convention with kwargs documentation, but I can look around (see bottom of this Sphinx doc page: https://pythonhosted.org/an_example_pypi_project/sphinx.html#full-code-exampl). @thodson-usgs, what do you think? I appreciate linking to the webservices guide, which means we don't need to be checking our documentation against the WQP's, but for the uninitiated, wading through the webservices guide (which covers building urls but isn't exactly specific to using a python function) might be a little confusing. Again, I think I'm going to add this as an issue and let this PR only mess with the wqx3 machinery stuff.

'ResultDetectionQuantitationLimit',
'BiologicalMetric']

def get_results(ssl_check=True, legacy=True, **kwargs):
"""Query the WQP for results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the documentation here and in the other what_* functions could be clearer. For example, what are the "results" that one would expect to obtain? Or what does it mean to search for sites/projects/organizations/activities/etc. "within a region with specific data?" For what purpose would someone use these functions? A simple description of these WQP endpoints would be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the definitions of the functions are a bit murky. We can use dataRetrieval as a good example: https://doi-usgs.github.io/dataRetrieval/reference/wqpSpecials.html, though also quite light on specifics.

Copy link
Collaborator

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When attempting to use the new profile, I got a looooong delay then an error message. Let's table the other discussions until the basic functionality is working.

@jzemmels
Copy link
Collaborator

When attempting to use the new profile, I got a looooong delay then an error message. Let's table the other discussions until the basic functionality is working.

Interesting, the WQX 3.0 profiles seemed to be working for me earlier but aren't now. I just tried to perform the same query on the WQP beta webpage and also ran into issues. It might be a timeout problem?

@thodson-usgs
Copy link
Collaborator

Yeah, seems to be an issue on the service side. Let's add another warning for "legacy=False"; otherwise, I'm fine with this as a first crack PR.

@lstanish-usgs
Copy link
Collaborator

@jzemmels can you share the tests you ran that gave you errors? I want to share them with the services development teams.

@jzemmels
Copy link
Collaborator

@jzemmels can you share the tests you ran that gave you errors? I want to share them with the services development teams.

Sure, this call threw an error.
wqp.get_results(countrycode="US", statecode="US:21", mimeType="csv", providers=["STORET","NWIS"], dataProfile="fullPhysChem", startDateLo="01-01-2024", startDateHi="09-01-2024", legacy=False)

And this call returned an empty data frame after about a minute.
wqp.get_results(countrycode="US", statecode="US:21", mimeType="csv", providers=["STORET","NWIS"], dataProfile="narrow", startDateLo="01-01-2024", startDateHi="09-01-2024", legacy=False)

The equivalent URL couldn't complete the download, so I assume its somehow related to the empty data frame.
https://www.waterqualitydata.us/wqx3/Result/search?countrycode=US&statecode=US%3A21&mimeType=csv&providers=STORET%3BNWIS&dataProfile=narrow&startDateLo=01-01-2024&startDateHi=09-01-2024

Copy link
Collaborator

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve, though I added some suggestions in a PR to your fork, which you might want to merge first.
Nice work.

@thodson-usgs thodson-usgs merged commit 9f09525 into DOI-USGS:master Sep 17, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants