-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚡ Do nothing well (2023 server edition) #931
Merged
shankari
merged 11 commits into
e-mission:master
from
shankari:scalability_improvements
Aug 15, 2023
Merged
⚡ Do nothing well (2023 server edition) #931
shankari
merged 11 commits into
e-mission:master
from
shankari:scalability_improvements
Aug 15, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Without this change, even if we have no new entries, we run through the rest of the pipeline. Some of the queries for the rest of the pipeline can be fairly slow - taking ~ 30 secs to complete. With ~ 100 people per program, even if we had no new data, we will take 100 * 30 = 3000 secs = 50 minutes to process all the data. Instead, let's do an early return if we have no new data. This allows us to hopefully cut that runtime down significantly. This is a partial fix for e-mission/e-mission-docs#942 (comment) Testing done: - Added new test case which: - runs the pipeline on an empty dataset and checks that none of the states are updated - runs the pipeline on a non-empty dataset with no new data and checks that none of the states are updated - Test passes ``` ---------------------------------------------------------------------- Ran 1 test in 2.361s OK ```
…s no new data In 7e5aacb, we configured the pipeline to return early if there is no new data. That is good for performance, but can have some limitations if we make changes to the pipeline to address errors - the modified stage will not run until there is new data. Just to be on the safe side, add a command line option to control this feature, and plumb it through the script ('intake_multiprocess' -> 'scheduler' -> 'intake') with a default of 'False' to retain the original behavior. This is a partial fix of e-mission/e-mission-docs#942 (comment) We were then able to change the test to run the pipeline with 'False' instead of running the test pipeline. This added the 'OUTPUT_GEN' step to the set of valid states. After adding that, the test passes ''' ---------------------------------------------------------------------- Ran 1 test in 3.825s OK '''
…g trips We currently build the trip model for every single user, regardless of whether they are actively contributing data or not. But if they have stopped contributing data, their model is the same, we don't need to rebuild it. So we skip the model building if their last trip's end is too old This is a partial fix for: e-mission/e-mission-docs#942 (comment) Testing done: - Commented out the actual model - Loaded the Denver CASR data - Spot checked both decisions ``` 2023-08-13 19:23:11,719:INFO:building model for user U1 2023-08-13 19:23:11,722:DEBUG:last trip was at 2022-10-22T15:51:49.993526+00:00, three weeks ago was 2023-07-24T02:23:11.722757+00:00, gap = 9 months ago, skipping model building... 2023-08-13 19:23:11,722:INFO:building model for user U2 2023-08-13 19:23:11,726:DEBUG:last trip was at 2023-08-01T01:46:10.019082+00:00, three weeks ago was 2023-07-24T02:23:11.726590+00:00, gap is in a week, building model... ```
Add a new optional argument for `ignore_older_than_weeks` Wrap the skip functionality so that it is called only if `ignore_older_than_weeks` is specified. Testing done: - similar to 3cfda25, turned off the actual model building - `./e-mission-py.bash bin/build_label_model.py -a`, all models were run - `./e-mission-py.bash bin/build_label_model.py -a -i 3`, some models were skipped ``` 2023-08-14 09:27:04,394:INFO:building model for user ... 2023-08-14 09:27:04,402:DEBUG:last trip was at 2022-10-23T17:18:28.999999+00:00, three weeks ago was 2023-07-24T16:27:04.402425+00:00, gap = 9 months ago, skipping model building... 2023-08-14 09:27:04,402:INFO:building model for user ... 2023-08-14 09:27:04,425:DEBUG:last trip was at 2023-08-04T00:50:24.999804+00:00, three weeks ago was 2023-07-24T16:27:04.425763+00:00, gap is in a week, building model... ```
When we match the user inputs to existing confirmed trips, we "fill in" the inputs and update the confirmed trip. Hwoever, now that we retrieve the composite trips, we need to propagate those changes downstream as well. We used to find the matching composite trip for a confirmed trip by searching for the entry whose `data.confirmed_trip` field was the id of the confirmed trip. However, the `data.confirmed_trip` field is not indexed in the database. So on larger data collection efforts, such as Denver CASR, composite trip to confirmed trip mapping was taking 10 **minutes**. Since the confirmed and composite trips will always have the same start and end timestamps we can find the entry that has the same timestamp instead to find the match. This drops the access time from 10 minutes to ~ 2 minutes e-mission/e-mission-docs#942 (comment) Once we resolve the slow indexing on DocumentDB, this has the potential to drop even further. Side changes (also documented in e-mission/e-mission-docs#942 (comment)) - we don't override the start and end timestamps and locations anymore (since we don't anticipate filling them in anyway) - changed the unit tests, which were changing the `start_ts` to change other fields instead since we now lookup by `start_ts` Testing done: `TestCompositeTripCreation.py` passes with the changes
…the fly Before this, we used to compute the pipeline range while retrieving it to display the label screen. However, as we can see from e-mission/e-mission-docs#942 (comment) the call to get the oldest trip can take up to 10 seconds (!!) on certain deployments. However, retrieving data from the pipeline or profile databases is very fast So we compute the range at the end of the pipeline run and store it in the user profile. We can then retrieve the data directly from the user profile when requested. e-mission/e-mission-docs#942 (comment) Additional changes: - remove the fallback to the cleaned and confirmed trips since we can't display them in the UI anyway, and the fallback can lead to a _hole_ in the displayed trips e-mission/e-mission-docs#942 (comment) - remove the check against the complete_ts since we have not encountered it in also removed the check against the complete_ts since that was also a backwards compat check, and we haven't run into it even once over 1.5 years of `open_access` and `nrel_commute` e-mission/e-mission-docs#942 (comment) Testing done: - After making the changes, tried to load trips from a test user with real data loaded - range was None, None, no trips were returned ``` 2023-08-14 14:59:04,432:DEBUG:123145673592832:START POST /pipeline/get_range_ts 2023-08-14 14:59:04,433:DEBUG:123145673592832:methodName = skip, returning <class 'emission.net.auth.skip.Skip Method'> 2023-08-14 14:59:04,434:DEBUG:123145673592832:Using the skip method to verify id token nrelop_dev-emulator-pro gram_default_testuser of length 44 2023-08-14 14:59:04,437:DEBUG:123145673592832:retUUID = .... 2023-08-14 14:59:04,440:DEBUG:123145673592832:Returning range (None, None) 2023-08-14 14:59:04,440:DEBUG:123145673592832:END POST /pipeline/get_range_ts .... 0.008764982223510742 ``` - Reset and re-ran the pipeline - range was set and trips were returned and displayed correctly ``` 2023-08-14 15:19:22,953:DEBUG:123145807908864:START POST /pipeline/get_range_ts 2023-08-14 15:19:22,954:DEBUG:123145807908864:methodName = skip, returning <class 'emission.net.auth.skip.Skip Method'> 2023-08-14 15:19:22,955:DEBUG:123145807908864:Using the skip method to verify id token nrelop_dev-emulator-pro gram_default_testuser of length 44 2023-08-14 15:19:22,962:DEBUG:123145807908864:retUUID = ... 2023-08-14 15:19:22,967:DEBUG:123145807908864:Returning range (1688845113.167196, 1691156058.2945251) 2023-08-14 15:19:22,968:DEBUG:123145807908864:END POST /pipeline/get_range_ts ... 0.016937971115112305 ```
Since they also make calls using fields with no index (`data.trip_id`), and may have performance issues. e-mission/e-mission-docs#942 (comment) These will help us determine whether they do actually have performance issues and how much we can improve by changing the implementation Testing done: - Ran the pipeline with the change - Saw logs ``` 2023-08-14 21:38:08,150:DEBUG:140704350807680:get_section_summary(64db00a82032f67e94165626, analysis/cleaned_s ection) called 2023-08-14 21:38:08,488:DEBUG:140704350807680:get_section_summary(64db00a82032f67e94165626, analysis/cleaned_s 2023-08-14 21:38:09,679:DEBUG:140704350807680:get_section_summary(64db00aa2032f67e94165644, analysis/cleaned_section) called 2023-08-14 21:38:09,983:DEBUG:140704350807680:get_section_summary(64db00aa2032f67e94165644, analysis/cleaned_section) returning {'distance': {'BICYCLING': 5789.554643480443}, 'duration': {'BICYCLING': 1226.4553627967834}, 'count': {'BICYCLING': 1}} ... ```
Fix errors in e-mission#931 to fix e-mission/e-mission-docs#942 Testing done: - Ran ``` ./e-mission-py.bash bin/intake_multiprocess.py 1 --skip_if_no_new_data ``` - Several users were skipped ``` 2023-08-14T22:23:23.331295-07:00**********UUID : moving to long term********** No new entries, and skip_if_no_new_data = True, skipping the rest of the pipeline 2023-08-14T22:23:23.525582-07:00**********UUID : moving to long term********** No new entries, and skip_if_no_new_data = True, skipping the rest of the pipeline 2023-08-14T22:23:23.748683-07:00**********UUID : moving to long term********** No new entries, and skip_if_no_new_data = True, skipping the rest of the pipeline ``` - And several other users were not ``` 2023-08-14T22:24:30.000684-07:00**********UUID : moving to long term********** New entry count == 207 and skip_if_no_new_data = True, continuing 2023-08-14T22:24:30.830030-07:00**********UUID : updating incoming user inputs********** 2023-08-14T22:24:31.705677-07:00**********UUID : filter accuracy if needed********** ```
We can remove this later if we don't need this level of verbosity
In 73414b9, we added logging to track how long the section summaries take. The log prints out the trip id as the first line of the function ``` + logging.debug(f"get_section_summary({cleaned_trip['_id']}, {section_key}) called") ``` So the error when we pass in a fake trip is now that the `_id` doesn't exist, not that the `user_id` doesn't exist. Fixed by changing the expected exception Testing done: ``` ---------------------------------------------------------------------- Ran 1 test in 0.247s OK ```
In 0403ae0, we start precomputing the pipeline stage and storing it in this profile. This means that we need the user to have a profile to begin with. In the real world, users will always have profiles since we register them during `profile/create` In this test, we set the `testEmail` so that we register the users as part of the test as well. This implies that we also: - need to delete the profile_db() as part of cleanup - use the version of the pipeline in `epi` so that it will run the pre-population code Once we do that, the tests pass. However, when they were failing, we realized that we can't compare two tuples with `assertAlmostEqual` since the assertion can't subtract tuples to get the difference. Instead, we separate out the start and end and compare them independently with `assertAlmostEqual` Testing done: both tests pass ``` ---------------------------------------------------------------------- Ran 2 tests in 33.884s OK ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This includes the following changes:
Pending:
get_pipeline_range_ts
This is a partial fix for e-mission/e-mission-docs#942