-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client: graceful way to pause/continue and deal with runtime errors? #1900
Comments
Hi @mkhorton, glad you're finding it useful -- and all good questions! To lump the first two questions together, callbacks are the only way of really doing this right now; my most robust workflow uses callbacks to rewrite the Just found the code for the restartable mongo callback and its not pretty (note in this case there is no filter; potentially could consider having baked in features for making per-filter databases/output files):if __name__ == "__main__":
import pymongo as pm
from optimade.client import OptimadeClient
from httpx import URL
import urllib.parse
client = pm.MongoClient("mongodb://localhost:27017/optimade_example")
collection = client.optimade_example.structures
collection.create_index("immutable_id", unique=True)
collection.create_index("prefix")
def insert_into_mongo(url, results):
"""Inserts data into a MongoDB collection."""
prefix = results["meta"].get("provider", {}).get("prefix", None)
url = URL(url)
next_url = None
duplicates = False
# start = time.monotonic_ns()
for entry in results["data"]:
formula = entry.pop("attributes")["chemical_formula_reduced"]
entry["chemical_formula_reduced"] = formula
entry["prefix"] = prefix
entry["immutable_id"] = f"{url.scheme}://{url.host}{url.path}/{entry['id']}"
try:
collection.insert_one(entry)
except pm.errors.DuplicateKeyError:
duplicates = True
if duplicates:
number_of_results_for_prefix = collection.count_documents(
{"prefix": prefix}
)
suggested_page_offset = number_of_results_for_prefix - 1
_next_url = results.get("links", {}).get("next")
if isinstance(_next_url, dict):
_next_url = _next_url.get("href")
# If we have already reset the page offset once, don't do it again
page_offset = urllib.parse.parse_qs(
urllib.parse.urlparse(_next_url).query
).get("page_offset", [None])[0]
if page_offset is None:
return
page_offset = int(page_offset)
if _next_url and page_offset < 110:
# Change the page offset to the suggested value using urllib.parse
next_url = str(
URL(_next_url).copy_set_param("page_offset", suggested_page_offset)
)
if next_url:
print(
f"Overwriting next_url to {next_url}, existing results {suggested_page_offset + 1}"
)
return {"next": next_url, "advance_results": number_of_results_for_prefix}
# (time.monotonic_ns() - start) / 1e9
# print(f"Callback ran in {elapsed:.2f} s")
return None
download_structures = False
client = OptimadeClient(
max_results_per_provider=-1,
# include_providers=["mpds", "omdb", "aflow"],
callbacks=[insert_into_mongo],
)
all_formulae = client.get(response_fields=["chemical_formula_reduced"]) For final 2 questions:
|
Thanks @ml-evs, regarding:
If you do have code on hand, that'd be super helpful! otherwise I can muddle through.
I think it's a mix. Some providers just need some extra time/backoff period, other providers have genuine issues. I think a good test maybe to use some filter that returns a large amount of documents, and then arbitrarily try a page with a very high page number. If it works, great, if it doesn't work, it probably suggests some underlying server issue.
Difficult to say; I'd be in favor of adding it for politeness regardless, since some people might not add that |
Sorry I hid it somewhat in my comment above, you should be able to expand the sentence "Just found the code..." to see the snippet.
Large queries are still a bit of an issue; until recently we still had a whole COLSCAN going on in the reference implementation as we needed to get the number of returned entries, but instead we have now made this optional (and obey a MongoDB internal timeout). Mostly we have been getting away with this by just running sufficiently small databases with enough memory to make this access fast as I really don't want to have to mess around with cursor pagination and such in MongoDB (the OPTIMADE spec is designed such that this should be possible though). I know that the Elasticsearch-based implementations also struggle with more than 10,000 results by default unless you implement the Scroll API (which I do not have the bandwidth or expertise to do in optimade-python-tools) (see #1291). We can definitely try to be more robust to this though.
I'll remind myself of our current approach and consider adding this, should be straightforward. |
Hi there! Thanks for the great work on the OPTIMADE client, it's really pleasant to use, and looks very well designed. I appreciate it especially having written the (quite poorer) interface in pymatgen, which one day perhaps we can deprecate :)
While using the interface, I frequently encounter timeout errors with some databases. These might look like:
or
This raises a few questions, and I'm not sure the best resolution:
--output-file
, the output file will be empty when an error is encountered, even if many structures were successfully retrieved. I understand using a callback is probably the best option here.TimeoutError
, and not aRuntimeError
?retry
library or similar, and have an automatically increasing sleep time after each retry?Apologies if there is an existing issue for this. I did have a look but couldn't find one. If I'm mis-using the library and this is already supported somehow, I'd be glad to know! Thanks again!
The text was updated successfully, but these errors were encountered: