Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for yyyy_mm_dd metrics, query by fmt_time instead of local_dt #970

Merged
merged 5 commits into from
Jun 25, 2024

Conversation

JGreenlee
Copy link
Contributor

I recently rewrote the local dt queries used by TimeComponentQuery (#968) to make them behave as datetime range queries. But this results in some pretty complex queries which could have nested $and and $or conditions. It felt overengineered but that was the logic required if we were to query date ranges by looking at local dt objects where year, month, day, etc are all recorded separately. Unfortunately, those queries run super slowly on production. I think this is because the $and / $or conditions prevented MongoDB from optimizing efficiently via indexing

Looked for a different solution, I found something better and simpler. At first I didn't think using fmt_time fields with $lt and $gt comparisons would work becuase they're ISO strings, not numbers. But luckily MongoDB can compare strings this way. it performs a lexicographical comparison where 0-9 < A-Z < a-z (like ASCII). ISO has the standard format 0000-00-00T00:00:00
The dashes and the T will always be in the same places, so effectively, only the numbers will be compared. And the timezone info is at the end, so it doesn't get considered as long as we don't include it in the start and end inputs. The only thing that specifically needs to be handled is making the end range inclusive. If I query from "2024-06-01" to "2024-06-03", an entry like "2024-06-03T23:59:59-04:00" should match. But lexicographically, that comes after "2024-06-03". Appending a "Z" to the end range solves this.

The result of all this is that the TimeQuery class (timequery.py) is now able to handle either timestamp or ISO dates, and this is what metrics.py will use for summarize_by_yyyy_mm_dd.

I recently rewrote the local dt queries used by TimeComponentQuery (e-mission#968) to make them behave as datetime range queries. But this results in some pretty complex queries which could have nested $and and $or conditions. It felt overengineered but that was the logic required if we were to query date ranges by looking at local dt objects where year, month, day, etc are all recorded separately.
Unfortunately, those queries run super slowly on production. I think this is because the $and / $or conditions prevented MongoDB from optimizing efficiently via indexing

Looked for a different solution, I found something better and simpler.
At first I didn't think using fmt_time fields with $lt and $gt comparisons would work becuase they're ISO strings, not numbers.
But luckily MongoDB can compare strings this way. it performs a lexicographical comparison where 0-9 < A-Z < a-z (like ASCII).
ISO has the standard format 0000-00-00T00:00:00
The dashes and the T will always be in the same places, so effectively, only the numbers will be compared. And the timezone info is at the end, so it doesn't get considered as long as we don't include it in the start and end inputs.
The only thing that specifically needs to be handled is making the end range inclusive. If I query from "2024-06-01" to "2024-06-03", an entry like "2024-06-03T23:59:59-04:00" should match. But lexicographically, that comes after "2024-06-03".
Appending a "Z" to the end range solves this.

The result of all this is that the TimeQuery class (timequery.py) is now able to handle either timestamp or ISO dates, and this is what metrics.py will use for summarize_by_yyyy_mm_dd.
@JGreenlee
Copy link
Contributor Author

@shankari
Since this method will work for querying datetime ranges and is probably more performant, should I revert #968 back to the old "set of filters" implementation? You mentioned it was useful for certain things like weekday/weekend filtering.
Maybe go back to the old implementation but rename it to "filter query" instead of "range query" ?

@shankari
Copy link
Contributor

shankari commented Jun 7, 2024

Maybe go back to the old implementation but rename it to "filter query" instead of "range query" ?

I vote for "filter query" instead of "range query"

@JGreenlee
Copy link
Contributor Author

I realized that the TimeQuery on e-mission-server (which works w/ MongoDB) mirrors TimeQuery in the native code plugins (which work with SQL dbs).

For that reason, I don't want to rename anything in TimeQuery right now. So instead of extending the functionality of TimeQuery to work with either ts or fmt_time, I'm going to make a new class called FmtTimeQuery (or similar)

I briefly read up on SQL queries to make sure they work the same way when comparing strings. They do – so later on, we could implement FmtTimeQuery on the native code UserCache plugin too

e-mission#970 (comment)
> I realized that the TimeQuery on `e-mission-server` (which works w/ MongoDB) mirrors TimeQuery in the native code plugins (which work with SQL dbs).
>
> For that reason, I don't want to rename anything in `TimeQuery` right now. So instead of extending the functionality of `TimeQuery` to work with either ts or fmt_time, I'm going to make a new class called `FmtTimeQuery` (or similar)
>
> I briefly read up on SQL queries to make sure they work the same way when comparing strings. They do – so later on, we could implement `FmtTimeQuery` on the native code UserCache plugin too
e-mission#970 (comment)

reverts e-mission#968 back to the original implementation since we now have a faster way to do "range" queries via FmtTimeQuery.

To make it clearer that FmtTimeQuery is for ranges and TimeComponentQuery is for filters, I renamed some functions and added more descriptive comments/docstrings
@JGreenlee JGreenlee marked this pull request as ready for review June 14, 2024 16:31
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Glad to see that this worked out; the ISO folks designed their format correctly!
  • I am merging this for now, but we should add tests for the new FmtTimeQuery, similar to the current filter query
  • I assume that the changes to the local date queries are just reverting the previous changes. I spot checked and that does seem to be true, but please flag locations that were not a simple revert so that I can take a closer look later.


def get_query(self) -> dict:
time_key = self.timeType
ret_query = {time_key: {"$lte": self.endIso}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect that the end will always be present?
An alternate approach would be to assume that the start will always be present, and set the end to today

@shankari shankari merged commit f55f223 into e-mission:master Jun 25, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Address code reviews with "future"
Development

Successfully merging this pull request may close these issues.

2 participants