Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡ Do nothing well (2023 server edition) #931

Merged
merged 11 commits into from
Aug 15, 2023

Conversation

shankari
Copy link
Contributor

@shankari shankari commented Aug 14, 2023

This includes the following changes:

  • skip pipeline if no new data
  • skip model building if no processed trips in three weeks
  • improve the performance of the user matching code

Pending:

  • Improve the performance of get_pipeline_range_ts

This is a partial fix for e-mission/e-mission-docs#942

Without this change, even if we have no new entries, we run through the rest of
the pipeline. Some of the queries for the rest of the pipeline can be fairly
slow - taking ~ 30 secs to complete. With ~ 100 people per program, even if we
had no new data, we will take 100 * 30 = 3000 secs = 50 minutes to process all
the data.

Instead, let's do an early return if we have no new data.
This allows us to hopefully cut that runtime down significantly.
This is a partial fix for
e-mission/e-mission-docs#942 (comment)

Testing done:
- Added new test case which:
    - runs the pipeline on an empty dataset and checks that none of the states are updated
    - runs the pipeline on a non-empty dataset with no new data and checks that none of the states are updated
- Test passes
    ```
    ----------------------------------------------------------------------
    Ran 1 test in 2.361s

    OK
    ```
…s no new data

In 7e5aacb, we configured the pipeline to
return early if there is no new data. That is good for performance, but can
have some limitations if we make changes to the pipeline to address errors -
the modified stage will not run until there is new data.

Just to be on the safe side, add a command line option to control this feature,
and plumb it through the script ('intake_multiprocess' -> 'scheduler' ->
'intake') with a default of 'False' to retain the original behavior.

This is a partial fix of
e-mission/e-mission-docs#942 (comment)

We were then able to change the test to run the pipeline with 'False' instead
of running the test pipeline. This added the 'OUTPUT_GEN' step to the set of valid states.
After adding that, the test passes

'''
----------------------------------------------------------------------
Ran 1 test in 3.825s

OK
'''
…g trips

We currently build the trip model for every single user, regardless of whether
they are actively contributing data or not. But if they have stopped
contributing data, their model is the same, we don't need to rebuild it.

So we skip the model building if their last trip's end is too old

This is a partial fix for:
e-mission/e-mission-docs#942 (comment)

Testing done:
- Commented out the actual model
- Loaded the Denver CASR data
- Spot checked both decisions

```
2023-08-13 19:23:11,719:INFO:building model for user U1
2023-08-13 19:23:11,722:DEBUG:last trip was at 2022-10-22T15:51:49.993526+00:00, three weeks ago was 2023-07-24T02:23:11.722757+00:00, gap = 9 months ago, skipping model building...

2023-08-13 19:23:11,722:INFO:building model for user U2
2023-08-13 19:23:11,726:DEBUG:last trip was at 2023-08-01T01:46:10.019082+00:00, three weeks ago was 2023-07-24T02:23:11.726590+00:00, gap is in a week, building model...
```
Add a new optional argument for `ignore_older_than_weeks`
Wrap the skip functionality so that it is called only if `ignore_older_than_weeks` is specified.

Testing done:
- similar to 3cfda25, turned off the actual model building
- `./e-mission-py.bash bin/build_label_model.py -a`, all models were run
- `./e-mission-py.bash bin/build_label_model.py -a -i 3`, some models were skipped

```
2023-08-14 09:27:04,394:INFO:building model for user ...
2023-08-14 09:27:04,402:DEBUG:last trip was at 2022-10-23T17:18:28.999999+00:00, three weeks ago was 2023-07-24T16:27:04.402425+00:00, gap = 9 months ago, skipping model building...
2023-08-14 09:27:04,402:INFO:building model for user ...
2023-08-14 09:27:04,425:DEBUG:last trip was at 2023-08-04T00:50:24.999804+00:00, three weeks ago was 2023-07-24T16:27:04.425763+00:00, gap is in a week, building model...
```
When we match the user inputs to existing confirmed trips, we "fill in" the
inputs and update the confirmed trip. Hwoever, now that we retrieve the
composite trips, we need to propagate those changes downstream as well.

We used to find the matching composite trip for a confirmed trip by searching
for the entry whose `data.confirmed_trip` field was the id of the confirmed trip.
However, the `data.confirmed_trip` field is not indexed in the database.

So on larger data collection efforts, such as Denver CASR, composite trip to
confirmed trip mapping was taking 10 **minutes**. Since the confirmed and
composite trips will always have the same start and end timestamps we can find
the entry that has the same timestamp instead to find the match.

This drops the access time from 10 minutes to ~ 2 minutes
e-mission/e-mission-docs#942 (comment)

Once we resolve the slow indexing on DocumentDB, this has the potential to drop
even further.

Side changes (also documented in
e-mission/e-mission-docs#942 (comment))
- we don't override the start and end timestamps and locations anymore (since
  we don't anticipate filling them in anyway)
- changed the unit tests, which were changing the `start_ts` to change other
  fields instead since we now lookup by `start_ts`

Testing done:
`TestCompositeTripCreation.py` passes with the changes
…the fly

Before this, we used to compute the pipeline range while retrieving it
to display the label screen.

However, as we can see from
e-mission/e-mission-docs#942 (comment)
the call to get the oldest trip can take up to 10 seconds (!!) on certain deployments.

However, retrieving data from the pipeline or profile databases is very fast

So we compute the range at the end of the pipeline run and store it in the user profile.
We can then retrieve the data directly from the user profile when requested.
e-mission/e-mission-docs#942 (comment)

Additional changes:
- remove the fallback to the cleaned and confirmed trips since we can't display
  them in the UI anyway, and the fallback can lead to a _hole_ in the displayed trips
  e-mission/e-mission-docs#942 (comment)

- remove the check against the complete_ts since we have not encountered it in
    also removed the check against the complete_ts since that was also a
    backwards compat check, and we haven't run into it even once over 1.5 years
    of `open_access` and `nrel_commute`
    e-mission/e-mission-docs#942 (comment)

Testing done:
- After making the changes, tried to load trips from a test user with real data loaded
    - range was None, None, no trips were returned
```
2023-08-14 14:59:04,432:DEBUG:123145673592832:START POST /pipeline/get_range_ts
2023-08-14 14:59:04,433:DEBUG:123145673592832:methodName = skip, returning <class 'emission.net.auth.skip.Skip
Method'>
2023-08-14 14:59:04,434:DEBUG:123145673592832:Using the skip method to verify id token nrelop_dev-emulator-pro
gram_default_testuser of length 44
2023-08-14 14:59:04,437:DEBUG:123145673592832:retUUID = ....
2023-08-14 14:59:04,440:DEBUG:123145673592832:Returning range (None, None)
2023-08-14 14:59:04,440:DEBUG:123145673592832:END POST /pipeline/get_range_ts ....  0.008764982223510742
```
- Reset and re-ran the pipeline
    - range was set and trips were returned and displayed correctly

```
2023-08-14 15:19:22,953:DEBUG:123145807908864:START POST /pipeline/get_range_ts
2023-08-14 15:19:22,954:DEBUG:123145807908864:methodName = skip, returning <class 'emission.net.auth.skip.Skip
Method'>
2023-08-14 15:19:22,955:DEBUG:123145807908864:Using the skip method to verify id token nrelop_dev-emulator-pro
gram_default_testuser of length 44
2023-08-14 15:19:22,962:DEBUG:123145807908864:retUUID = ...
2023-08-14 15:19:22,967:DEBUG:123145807908864:Returning range (1688845113.167196, 1691156058.2945251)
2023-08-14 15:19:22,968:DEBUG:123145807908864:END POST /pipeline/get_range_ts ...  0.016937971115112305
```
Since they also make calls using fields with no index (`data.trip_id`), and may
have performance issues.
e-mission/e-mission-docs#942 (comment)

These will help us determine whether they do actually have performance issues
and how much we can improve by changing the implementation

Testing done:
- Ran the pipeline with the change
- Saw logs

```
2023-08-14 21:38:08,150:DEBUG:140704350807680:get_section_summary(64db00a82032f67e94165626, analysis/cleaned_s
ection) called
2023-08-14 21:38:08,488:DEBUG:140704350807680:get_section_summary(64db00a82032f67e94165626, analysis/cleaned_s

2023-08-14 21:38:09,679:DEBUG:140704350807680:get_section_summary(64db00aa2032f67e94165644, analysis/cleaned_section) called
2023-08-14 21:38:09,983:DEBUG:140704350807680:get_section_summary(64db00aa2032f67e94165644, analysis/cleaned_section) returning {'distance': {'BICYCLING': 5789.554643480443}, 'duration': {'BICYCLING': 1226.4553627967834}, 'count': {'BICYCLING': 1}}

...
```
@shankari shankari changed the title ⚡ Do nothing well ⚡ Do nothing well (2023 server edition) Aug 15, 2023
Fix errors in
e-mission#931
to fix
e-mission/e-mission-docs#942

Testing done:
- Ran

```
./e-mission-py.bash bin/intake_multiprocess.py 1 --skip_if_no_new_data
```

- Several users were skipped

```
2023-08-14T22:23:23.331295-07:00**********UUID : moving to long term**********
No new entries, and skip_if_no_new_data = True, skipping the rest of the pipeline
2023-08-14T22:23:23.525582-07:00**********UUID : moving to long term**********
No new entries, and skip_if_no_new_data = True, skipping the rest of the pipeline
2023-08-14T22:23:23.748683-07:00**********UUID : moving to long term**********
No new entries, and skip_if_no_new_data = True, skipping the rest of the pipeline
```

- And several other users were not

```
2023-08-14T22:24:30.000684-07:00**********UUID : moving to long term**********
New entry count == 207 and skip_if_no_new_data = True, continuing
2023-08-14T22:24:30.830030-07:00**********UUID : updating incoming user inputs**********
2023-08-14T22:24:31.705677-07:00**********UUID : filter accuracy if needed**********
```
We can remove this later if we don't need this level of verbosity
In 73414b9, we added logging to track how long
the section summaries take. The log prints out the trip id as the first line of
the function

```
+    logging.debug(f"get_section_summary({cleaned_trip['_id']}, {section_key}) called")
```

So the error when we pass in a fake trip is now that the `_id` doesn't exist,
not that the `user_id` doesn't exist.

Fixed by changing the expected exception

Testing done:

```
----------------------------------------------------------------------
Ran 1 test in 0.247s

OK
```
In 0403ae0, we start precomputing the pipeline
stage and storing it in this profile. This means that we need the user to have
a profile to begin with.

In the real world, users will always have profiles since we register them
during `profile/create`

In this test, we set the `testEmail` so that we register the users as part of
the test as well. This implies that we also:
- need to delete the profile_db() as part of cleanup
- use the version of the pipeline in `epi` so that it will run the pre-population code

Once we do that, the tests pass. However, when they were failing, we realized
that we can't compare two tuples with `assertAlmostEqual` since the assertion
can't subtract tuples to get the difference. Instead, we separate out the start
and end and compare them independently with `assertAlmostEqual`

Testing done:
both tests pass

```
----------------------------------------------------------------------
Ran 2 tests in 33.884s

OK
```
@shankari shankari merged commit f92b00a into e-mission:master Aug 15, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant