-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add endpoint for querying multiple metadata #311
Conversation
Thanks a lot for the fast implementation @DenizYil ! General notesNew methods in driverI am not a fan of introducing additional driver code for this feature. The way I envisioned it was something like this in the handler: out = []
with driver.connect():
for keys in requested_datasets:
metadata = driver.get_metadata(keys)
metadata = {k: metadata[k] for k in requested_columns}
out.append(metadata) Sure, it's not getting optimal performance out of the SQL queries, but these are rarely a bottleneck anyway. On the flipside, we don't need to introduce hundreds of additional lines of code that will come back to haunt us if we ever implement a non-SQL metadata driver. I could be convinced otherwise with hard evidence (benchmarks that show this can be a real-world bottleneck), but in the absence of that I would strongly prefer the simpler option. API designWe should keep consistency between GET and POST endpoint wherever we can.
This keeps the symmetry between both endpoints. Specific questions
Sharing a file is fine.
Since the GET endpoint is called
I don't think the added complexity is worth it for this feature, both in terms of actual code and more complex API. IMO, doing
If the user explicitly requests data then they should get it. Maybe emit a warning to the logger when using the POST request with lazy loading. |
Thanks a lot for the fast feedback @dionhaefner - it's truly appreciated. I believe I should have addressed all your changes. Some notes:
Thanks a lot again, and I'm looking forward to your feedback! 😄 |
That seems very slow to me. Can you share the SQLite database? (No need to include the rasters.) |
Agreed. Apologies though, perhaps I should have clarified. I'm using a PostgreSQL database which is hosted on Azure - so the DB is not running locally. I've hosted a Terracotta instance using Azure Functions, which only contains 36 datasets. This is the URL: https://100ktrees-tc.azurewebsites.net/datasets |
Before we get into code review, there should be a feature to limit the amount of data to be requested (as a protection against [accidental] DOS). At first I thought we should use pagination, but then again that's probably not needed here because the number of datasets is known a-priori. How about we introduce a runtime setting for this, like |
Can you say more about how you tested the performance of the new endpoint, then? |
I simply ran this script, and the results are consistent. import requests
import time
start = time.time()
data = requests.post(
"http://localhost:5000/metadata?columns=[metadata]",
json={
"keys": [
["01012012", "HRL", "annually", "IMD"],
["01012012", "HRL", "annually", "TCD"],
["01012015", "HRL", "annually", "GRA"],
["01012015", "HRL", "annually", "IMD"],
["01012015", "HRL", "annually", "TCD"],
["01012015", "HRL", "annually", "WAW"],
["01012018", "HRL", "annually", "GRA"],
["01012018", "HRL", "annually", "IMD"],
["01012018", "HRL", "annually", "TCD"],
["01012018", "HRL", "annually", "WAW"],
["202206241029443", "NED", "1", "B01"],
["202206241029443", "NED", "1", "B02"],
["202206241029443", "NED", "1", "B03"],
["202206241029443", "NED", "1", "B04"],
["202206241029443", "RGB", "1", "B01"],
["202206241029443", "RGB", "1", "B02"],
["202206241029443", "RGB", "1", "B03"],
["202206241029443", "RGB", "1", "B04"],
["202206241029444", "1", "F", "P"],
["202206241029444", "NED", "1", "B01"],
["202206241029444", "NED", "1", "B02"],
["202206241029444", "NED", "1", "B03"],
["202206241029444", "NED", "1", "B04"],
["202206241029444", "RGB", "1", "B01"],
["202206241029444", "RGB", "1", "B02"],
["202206241029444", "RGB", "1", "B03"],
["202206241029444", "RGB", "1", "B04"],
["202304081018382", "1", "F", "P"],
["202304081018382", "NED", "1", "B01"],
["202304081018382", "NED", "1", "B02"],
["202304081018382", "NED", "1", "B03"],
["202304081018382", "NED", "1", "B04"],
["202304081018382", "RGB", "1", "B01"],
["202304081018382", "RGB", "1", "B02"],
["202304081018382", "RGB", "1", "B03"],
["202304081018382", "RGB", "1", "B04"],
]
},
)
end = time.time()
print(end - start)
Sure! I can add that into the PR. I think if we're going with the 1 SQL query per dataset, then 100 seems reasonable, but I do think that it seems a bit on the low-end. When this has been used in legitimate cases (like querying the |
So you are running a local Terracotta server that connects to the remote PostgreSQL database? |
Yes. I'll do some testing tomorrow with an Azure Function that connects to an Azure PostgreSQL database to see performance in a real production environment. I'll also do some testing with a locally running database (and server). I'll keep you updated :) |
Codecov Report
@@ Coverage Diff @@
## main #311 +/- ##
==========================================
+ Coverage 98.07% 98.11% +0.04%
==========================================
Files 52 52
Lines 2280 2335 +55
Branches 320 327 +7
==========================================
+ Hits 2236 2291 +55
Misses 29 29
Partials 15 15
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Starting to look good! Let's get that runtime config implemented and test coverage for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some last nits and we're good to merge. Thanks again @DenizYil
Should be good now @dionhaefner! Thank you so much for your thorough feedback and your patience :-) I know I made some silly mistakes, but I've learned a good amount from this, so thanks a lot 😄 This will be a very nice feature to have 🥳 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, thanks!
I didn't see any silly mistakes here. I think this PR was a display of healthy OSS dynamics at work. Thanks for contributing.
This PR closes #309
This PR provides the ability for users to send a
POST
request to the/metadata
endpoint, where they can filter the columns has path parameters, e.g.POST /metadata/bounds
, which will only return the bounds (and the keys used, which is default on all)Notes:
GET
request?post_metadata
orget_multiple_metadata
orget_metadatas
or something else..? For now, I named the endpointpost_metadata
and every other methodget_multiple_metadata
.bounds_south
as a column or should we only allow them to specifybounds
? (same withrange
&min/max
). For now, I have added this functionality._decode_specific_data
function.Please let me know if you have any feedback, as that is very appreciated. I'm excited to see this feature be used.