-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pagination & Sorting #66
Comments
Concerning pagination, I have found the following StackExchange post, which sheds some light on possible solutions and issues/pit falls for this problem. E.g., now the results are updated by having the database's
Pros:
Cons:
What do you think @csadorf, @giovannipizzi or @ml-evs? |
One of my original ideas was for each new gateway to initialize by always doing a single request to each database, storing the |
Another solution found while doing a bit of digging: https://github.com/chrisvxd/combine-pagination#framed-range-intersecting |
After some more perusing various sites and solutions, I think perhaps there are two options:
I guess, I'll try to implement the latter, to provide the gateway with some sort of pagination. |
I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail. |
I will give just some general comments as I don't know the current implementation good enough (happy to discuss in person on the design, probably it's faster). Then, it becomes a question of the GUI (very important though for the design, so it would be good to have a clear idea of how this is supposed to work) if results are grouped by provider (and you ask the same page length to each of them, and show results while they arrive), or if you mix them while they arrive, and then you need to indicate to the user that there are still more results that might be coming (e.g. 1) a progress indicator to show the % of providers that replied, and 2) some indication that there might be "intermediate" results missing). It would be good to check e.g. providers like kayak or sky scanner, how they present this in their web guis, as they have thought hard at how to make this efficient (even if I think they cache results) and easy to understand. I guess a good approach (for sorting) would be, say, to ask for the first 10 results from each DB, and show them, with some graphical indicator that there might be more in between and allow to click a button to fetch the next page of 10 more from the DB(s) that might have additional intermediate results. I don't think it's critical to have an exact pagination of the very final results (but of course you can, e.g. show the first 50 of all the 10*(num providers) you fetched. But if the order can change, it's better not to "promise" a final GUI page length, but just show it dynamically. Of course, all of this depends on the technology to transfer results to the GUI - to do what I suggest, there should be a way to ask "real-time" if there are more results (e.g. create a "session" or "token" per request, and the gateway keeps retrieving data in the backend, and the client can check often if there are new results in that session/token via AJAX or similar). On the other hand, if the reply can be compliant with OPTIMADE that would be great (and I wouldn't worry if this requires minor changes to the API, we can discuss and extend it if important). |
That would still result in queries where, e.g., the query parameter Edit: But if we don't promise that (as also seems to be suggested by @giovannipizzi), then it should be fine, of course. |
I think this gateway cannot handle this in any way, as this is usually an implementation done in JS or with CORS, i.e., directly in the browser. It might be able to mock it, but the design for the query-result delivery would have to change (drastically), I think.
Right. So this is the idea of implementing something similar to https://github.com/chrisvxd/combine-pagination#framed-range-intersecting or DynamoDB.
|
I don't understand this answer. The gateway acts as OPTIMADE provider. I tell the gateway I want to get a maximum of 50 results per page. As a user I at first don't care where those results are coming from, I just want the first 50 results. What is not aligning here? |
Because the results from each database comes in bulk, not one-by-one. So one would still have to wait for all results to come in and then afterwards apply a mix-sorting, and then finally cut-off any unwanted entries. |
Why are they coming in bulk? We can request paginated results internally, can we not? |
Sure. This would result in either many small external calls to retrieve single or couples of entries, which is quite ineffecient. And how would you also divide up results for various databases? Do you only return the 50 fastest results, while the user is expecting a mix of all databases when possible, if the number of entries don't exactly align when dividing the requested limit with number of databases, how do you determine who to in/-exclude? I think the consequences of your first comment and @giovannipizzi's comment is rather that the gateway shouldn't "promise" a returned number of entries. |
Like @giovannipizzi pointed out, it is important that we return results as they arrive (first-come-first-serve) to not slow down our response to the slowest data source of the mix or potentially even stall completely. The user expects that the results come from all sources, but unless they specify a specific ordering, there can be no expectation that they are in some sense well-mixed. Eventually the complete results set should contain all valid results from all sources, but if paginated, there can be no such expectation. This means that we should indeed make smaller requests to the individual sources and then start filling our pages immediately. I am not sure why that would be inefficient? It just means that we make multiple smaller requests at once. We can even specify a minimum page limit that we divide our page into, meaning with a page size of N and M data sources, the individual page limit for each data source is |
As described in the README here, pagination and sorting are issues that needs attention.
Pagination
Priority: MUST be solved
This should be solved, either via response caching or a query algorithm that ensures entries from any gateway databases are not overlooked, even for varying lengths of available entries in the databases.
Sorting
Priority: MAY be solved
This I consider secondary and not a MUST, since sorting is not a required property of OPTIMADE implementations. However, since it is a very desirable property when using this package as a backend for a client, it still has a high priority.
However, this cannot be implemented until the pagination issue has been solved.
Furthermore, I think sorting should be disabled for
attributes
fields if just a single gateway database does not support sorting.A possible implementation pathway could be to query the relevant
/info/<entry>
endpoint to collect all sortable fields and only offer sorting on the sortable fields common to all the gateway's databases. This should then be reflected in the gateway's own/info/<entry>
endpoint.The text was updated successfully, but these errors were encountered: