feat: add endpoint for querying multiple metadata #311

DenizYil · 2023-08-09T08:54:40Z

This PR closes #309

This PR provides the ability for users to send a POST request to the /metadata endpoint, where they can filter the columns has path parameters, e.g. POST /metadata/bounds, which will only return the bounds (and the keys used, which is default on all)

Notes:

I had a challenge with the naming convention and the structure itself. Should this endpoint be in its own file, or share a file with the GET request?
Should the methods be called post_metadata or get_multiple_metadata or get_metadatas or something else..? For now, I named the endpoint post_metadata and every other method get_multiple_metadata.
Should the user be able to specify e.g. bounds_south as a column or should we only allow them to specify bounds? (same with range & min/max). For now, I have added this functionality.
Please let me know if you have a better way to do the specific-decoding handling instead of the added _decode_specific_data function.
Please let me know as well if any variable names/types should be changed.
What should we do in the case of lazy-loading? Should we allow the user to lazy-load and generate metadata for everything? For now, I excluded this functionality as it could take a very long time. Let me know if you'd like this added.

Please let me know if you have any feedback, as that is very appreciated. I'm excited to see this feature be used.

dionhaefner · 2023-08-09T09:18:41Z

Thanks a lot for the fast implementation @DenizYil !

General notes

New methods in driver

I am not a fan of introducing additional driver code for this feature. The way I envisioned it was something like this in the handler:

out = []

with driver.connect():
    for keys in requested_datasets:
        metadata = driver.get_metadata(keys)
        metadata = {k: metadata[k] for k in requested_columns}
        out.append(metadata)

Sure, it's not getting optimal performance out of the SQL queries, but these are rarely a bottleneck anyway. On the flipside, we don't need to introduce hundreds of additional lines of code that will come back to haunt us if we ever implement a non-SQL metadata driver.

I could be convinced otherwise with hard evidence (benchmarks that show this can be a real-world bottleneck), but in the absence of that I would strongly prefer the simpler option.

API design

We should keep consistency between GET and POST endpoint wherever we can. POST /metadata/bounds is confusing because it looks like bounds could be a key value. I suggest the following API instead:

GET /metadata/<keys>?columns=[bounds,range]
POST /metadata?columns=[bounds,range]

This keeps the symmetry between both endpoints.

Specific questions

I had a challenge with the naming convention and the structure itself. Should this endpoint be in its own file, or share a file with the GET request?

Sharing a file is fine.

Should the methods be called post_metadata or get_multiple_metadata or get_metadatas or something else..? For now, I named the endpoint post_metadata and every other method get_multiple_metadata.

Since the GET endpoint is called get_metadata, I think the new endpoint should be called get_multiple_metadata. For methods see comment above.

Should the user be able to specify e.g. bounds_south as a column or should we only allow them to specify bounds? (same with range & min/max). For now, I have added this functionality.

I don't think the added complexity is worth it for this feature, both in terms of actual code and more complex API. IMO, doing metadata["bounds"][0] in the frontend isn't more complex than passing bounds_south explicitly.

What should we do in the case of lazy-loading? Should we allow the user to lazy-load and generate metadata for everything? For now, I excluded this functionality as it could take a very long time. Let me know if you'd like this added.

If the user explicitly requests data then they should get it. Maybe emit a warning to the logger when using the POST request with lazy loading.

DenizYil · 2023-08-09T12:33:53Z

Thanks a lot for the fast feedback @dionhaefner - it's truly appreciated. I believe I should have addressed all your changes.

Some notes:

With that being said, I kept the multiple_data handler function. I think it makes sense to not have this functionality in the endpoint - but let me know if I should change this. I could merge the functionality with the existing function and add a multi parameter (or use isInstance) or something to check if the contents of the list are lists or numbers - but that feels like added complexity.
I'll also note that on my computer (I've not run this in a real production setting yet) that performance is a lot slower. Querying 36 datasets takes 12 seconds on my computer (and it's a pretty good PC), whereas before it was almost instantaneous. I'm sure performance will be much better in production, but wanted to note it, nonetheless.
Let me know if you think CommaSeparatedListField is over-engineered. I wanted to stay true to using Marshmallow, but yeah.

Thanks a lot again, and I'm looking forward to your feedback! 😄

dionhaefner · 2023-08-09T12:38:18Z

I'll also note that on my computer (I've not run this in a real production setting yet) that performance is a lot slower. Querying 36 datasets takes 12 seconds on my computer (and it's a pretty good PC), whereas before it was almost instantaneous. I'm sure performance will be much better in production, but wanted to note it, nonetheless.

That seems very slow to me. Can you share the SQLite database? (No need to include the rasters.)

DenizYil · 2023-08-09T12:44:17Z

That seems very slow to me. Can you share the SQLite database?

Agreed.

Apologies though, perhaps I should have clarified. I'm using a PostgreSQL database which is hosted on Azure - so the DB is not running locally. I've hosted a Terracotta instance using Azure Functions, which only contains 36 datasets. This is the URL: https://100ktrees-tc.azurewebsites.net/datasets

dionhaefner · 2023-08-09T12:45:31Z

Before we get into code review, there should be a feature to limit the amount of data to be requested (as a protection against [accidental] DOS). At first I thought we should use pagination, but then again that's probably not needed here because the number of datasets is known a-priori. How about we introduce a runtime setting for this, like MAX_POST_METADATA_KEYS or so, and set that to a reasonable value by default (100?).

dionhaefner · 2023-08-09T12:50:19Z

Apologies though, perhaps I should have clarified. I'm using a PostgreSQL database which is hosted on Azure - so the DB is not running locally. I've hosted a Terracotta instance using Azure Functions, which only contains 36 datasets. This is the URL: https://100ktrees-tc.azurewebsites.net/datasets

Can you say more about how you tested the performance of the new endpoint, then?

DenizYil · 2023-08-09T13:52:51Z

Apologies though, perhaps I should have clarified. I'm using a PostgreSQL database which is hosted on Azure - so the DB is not running locally. I've hosted a Terracotta instance using Azure Functions, which only contains 36 datasets. This is the URL: https://100ktrees-tc.azurewebsites.net/datasets

Can you say more about how you tested the performance of the new endpoint, then?

I simply ran this script, and the results are consistent.

import requests
import time

start = time.time()

data = requests.post(
    "http://localhost:5000/metadata?columns=[metadata]",
    json={
        "keys": [
            ["01012012", "HRL", "annually", "IMD"],
            ["01012012", "HRL", "annually", "TCD"],
            ["01012015", "HRL", "annually", "GRA"],
            ["01012015", "HRL", "annually", "IMD"],
            ["01012015", "HRL", "annually", "TCD"],
            ["01012015", "HRL", "annually", "WAW"],
            ["01012018", "HRL", "annually", "GRA"],
            ["01012018", "HRL", "annually", "IMD"],
            ["01012018", "HRL", "annually", "TCD"],
            ["01012018", "HRL", "annually", "WAW"],
            ["202206241029443", "NED", "1", "B01"],
            ["202206241029443", "NED", "1", "B02"],
            ["202206241029443", "NED", "1", "B03"],
            ["202206241029443", "NED", "1", "B04"],
            ["202206241029443", "RGB", "1", "B01"],
            ["202206241029443", "RGB", "1", "B02"],
            ["202206241029443", "RGB", "1", "B03"],
            ["202206241029443", "RGB", "1", "B04"],
            ["202206241029444", "1", "F", "P"],
            ["202206241029444", "NED", "1", "B01"],
            ["202206241029444", "NED", "1", "B02"],
            ["202206241029444", "NED", "1", "B03"],
            ["202206241029444", "NED", "1", "B04"],
            ["202206241029444", "RGB", "1", "B01"],
            ["202206241029444", "RGB", "1", "B02"],
            ["202206241029444", "RGB", "1", "B03"],
            ["202206241029444", "RGB", "1", "B04"],
            ["202304081018382", "1", "F", "P"],
            ["202304081018382", "NED", "1", "B01"],
            ["202304081018382", "NED", "1", "B02"],
            ["202304081018382", "NED", "1", "B03"],
            ["202304081018382", "NED", "1", "B04"],
            ["202304081018382", "RGB", "1", "B01"],
            ["202304081018382", "RGB", "1", "B02"],
            ["202304081018382", "RGB", "1", "B03"],
            ["202304081018382", "RGB", "1", "B04"],
        ]
    },
)

end = time.time()

print(end - start)

Before we get into code review, there should be a feature to limit the amount of data to be requested (as a protection against [accidental] DOS). At first I thought we should use pagination, but then again that's probably not needed here because the number of datasets is known a-priori. How about we introduce a runtime setting for this, like MAX_POST_METADATA_KEYS or so, and set that to a reasonable value by default (100?).

Sure! I can add that into the PR. I think if we're going with the 1 SQL query per dataset, then 100 seems reasonable, but I do think that it seems a bit on the low-end. When this has been used in legitimate cases (like querying the GET /metadata endpoint), it's been in the 300s+. I think if we decide to go with the 1 SQL query per dataset, we could easily consider increasing the amount. I'll keep at 100 for now though :)

dionhaefner · 2023-08-09T22:08:41Z

So you are running a local Terracotta server that connects to the remote PostgreSQL database?

DenizYil · 2023-08-09T22:10:13Z

Yes. I'll do some testing tomorrow with an Azure Function that connects to an Azure PostgreSQL database to see performance in a real production environment. I'll also do some testing with a locally running database (and server). I'll keep you updated :)

terracotta/handlers/metadata.py

terracotta/server/metadata.py

codecov · 2023-08-11T00:05:25Z

Codecov Report

Merging #311 (ecff640) into main (3b28c08) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #311      +/-   ##
==========================================
+ Coverage   98.07%   98.11%   +0.04%     
==========================================
  Files          52       52              
  Lines        2280     2335      +55     
  Branches      320      327       +7     
==========================================
+ Hits         2236     2291      +55     
  Misses         29       29              
  Partials       15       15

Files Changed	Coverage Δ
terracotta/config.py	`100.00% <100.00%> (ø)`
terracotta/handlers/metadata.py	`100.00% <100.00%> (ø)`
terracotta/scripts/click_types.py	`98.61% <100.00%> (ø)`
terracotta/server/metadata.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

dionhaefner · 2023-08-11T09:18:04Z

Starting to look good! Let's get that runtime config implemented and test coverage for the POST /metdata endpoint, then we should be good to go.

terracotta/handlers/metadata.py

dionhaefner

Some last nits and we're good to merge. Thanks again @DenizYil

terracotta/server/metadata.py

terracotta/handlers/metadata.py

DenizYil · 2023-08-20T00:54:27Z

Should be good now @dionhaefner! Thank you so much for your thorough feedback and your patience :-) I know I made some silly mistakes, but I've learned a good amount from this, so thanks a lot 😄

This will be a very nice feature to have 🥳

dionhaefner

Looking great, thanks!

I didn't see any silly mistakes here. I think this PR was a display of healthy OSS dynamics at work. Thanks for contributing.

DenizYil requested review from j08lue and dionhaefner August 9, 2023 08:54

DenizYil force-pushed the feat/309 branch from d8afe7d to 5220f29 Compare August 9, 2023 11:49

add the ability to query multiple metadata

5e5bbb0

DenizYil force-pushed the feat/309 branch from 5220f29 to 5e5bbb0 Compare August 9, 2023 12:33

DenizYil added 2 commits August 9, 2023 21:40

fix tests

7e400d8

fix formatting

354a915

DenizYil force-pushed the feat/309 branch from 1a49041 to 354a915 Compare August 9, 2023 20:23

dionhaefner reviewed Aug 9, 2023

View reviewed changes

terracotta/handlers/metadata.py Outdated Show resolved Hide resolved

dionhaefner reviewed Aug 9, 2023

View reviewed changes

terracotta/server/metadata.py Outdated Show resolved Hide resolved

DenizYil and others added 5 commits August 10, 2023 01:31

surround block with driver.connect() to keep connection opened

2cda97e

change list parsing to use json and pre_load

7af8519

fix formatting

286bf2d

fix type checking

75435db

fix typing errors

bbeb12a

DenizYil added 3 commits August 17, 2023 20:55

add MAX_POST_METADATA_KEYS setting and testing for endpoint

90a9038

fix styling

767928b

add it to schema as well

6e59710

dionhaefner reviewed Aug 17, 2023

View reviewed changes

terracotta/handlers/metadata.py Outdated Show resolved Hide resolved

DenizYil added 3 commits August 17, 2023 21:58

test some exceptions

2d184e5

raise 400 if limit is exceeded instead

98eb8e2

fix styling

78e5377

dionhaefner requested changes Aug 18, 2023

View reviewed changes

terracotta/server/metadata.py Outdated Show resolved Hide resolved

terracotta/server/metadata.py Show resolved Hide resolved

terracotta/handlers/metadata.py Show resolved Hide resolved

DenizYil added 2 commits August 20, 2023 02:28

test more exceptions too

0e51fc7

improve 400 description

ecff640

DenizYil requested a review from dionhaefner August 20, 2023 00:53

dionhaefner approved these changes Aug 20, 2023

View reviewed changes

dionhaefner merged commit e6f1952 into DHI:main Aug 20, 2023
9 checks passed

j08lue mentioned this pull request Oct 5, 2023

Use percentiles in /rgb bands stretch range #320

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add endpoint for querying multiple metadata #311

feat: add endpoint for querying multiple metadata #311

DenizYil commented Aug 9, 2023 •

edited

Loading

dionhaefner commented Aug 9, 2023

DenizYil commented Aug 9, 2023

dionhaefner commented Aug 9, 2023 •

edited

Loading

DenizYil commented Aug 9, 2023

dionhaefner commented Aug 9, 2023

dionhaefner commented Aug 9, 2023

DenizYil commented Aug 9, 2023

dionhaefner commented Aug 9, 2023

DenizYil commented Aug 9, 2023

codecov bot commented Aug 11, 2023 •

edited

Loading

dionhaefner commented Aug 11, 2023

dionhaefner left a comment

DenizYil commented Aug 20, 2023

dionhaefner left a comment

feat: add endpoint for querying multiple metadata #311

feat: add endpoint for querying multiple metadata #311

Conversation

DenizYil commented Aug 9, 2023 • edited Loading

dionhaefner commented Aug 9, 2023

General notes

New methods in driver

API design

Specific questions

DenizYil commented Aug 9, 2023

dionhaefner commented Aug 9, 2023 • edited Loading

DenizYil commented Aug 9, 2023

dionhaefner commented Aug 9, 2023

dionhaefner commented Aug 9, 2023

DenizYil commented Aug 9, 2023

dionhaefner commented Aug 9, 2023

DenizYil commented Aug 9, 2023

codecov bot commented Aug 11, 2023 • edited Loading

Codecov Report

dionhaefner commented Aug 11, 2023

dionhaefner left a comment

Choose a reason for hiding this comment

DenizYil commented Aug 20, 2023

dionhaefner left a comment

Choose a reason for hiding this comment

DenizYil commented Aug 9, 2023 •

edited

Loading

dionhaefner commented Aug 9, 2023 •

edited

Loading

codecov bot commented Aug 11, 2023 •

edited

Loading