Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have the same lock for upload + preprocess task #502

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

adrian-codecov
Copy link
Contributor

This is an alternative to the solution proposed in this ticket #497 to attempt and correct the problem from the source rather than react to it.

The idea is to share the same lock for both of these processes to prevent both starting without initializing and saving the report.

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@codecov-notifications
Copy link

codecov-notifications bot commented Jun 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

✅ All tests successful. No failed tests found.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #502   +/-   ##
=======================================
  Coverage   97.45%   97.45%           
=======================================
  Files         416      416           
  Lines       34759    34761    +2     
=======================================
+ Hits        33876    33878    +2     
  Misses        883      883           
Flag Coverage Δ
integration 97.45% <100.00%> (+<0.01%) ⬆️
latest-uploader-overall 97.45% <100.00%> (+<0.01%) ⬆️
unit 97.45% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.46% <100.00%> (+<0.01%) ⬆️
OutsideTasks 97.73% <ø> (ø)
Files Coverage Δ
tasks/preprocess_upload.py 91.35% <100.00%> (+0.10%) ⬆️
tasks/upload.py 91.22% <100.00%> (+0.02%) ⬆️

@codecov-qa
Copy link

codecov-qa bot commented Jun 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.45%. Comparing base (4840ca4) to head (7d65468).

✅ All tests successful. No failed tests found.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #502   +/-   ##
=======================================
  Coverage   97.45%   97.45%           
=======================================
  Files         416      416           
  Lines       34759    34761    +2     
=======================================
+ Hits        33876    33878    +2     
  Misses        883      883           
Flag Coverage Δ
integration 97.45% <100.00%> (+<0.01%) ⬆️
latest-uploader-overall 97.45% <100.00%> (+<0.01%) ⬆️
unit 97.45% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.46% <100.00%> (+<0.01%) ⬆️
OutsideTasks 97.73% <ø> (ø)
Files Coverage Δ
tasks/preprocess_upload.py 91.35% <100.00%> (+0.10%) ⬆️
tasks/upload.py 91.22% <100.00%> (+0.02%) ⬆️

Copy link

codecov-public-qa bot commented Jun 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.45%. Comparing base (4840ca4) to head (7d65468).

✅ All tests successful. No failed tests found ☺️

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #502   +/-   ##
=======================================
  Coverage   97.45%   97.45%           
=======================================
  Files         416      416           
  Lines       34759    34761    +2     
=======================================
+ Hits        33876    33878    +2     
  Misses        883      883           
Flag Coverage Δ
integration 97.45% <100.00%> (+<0.01%) ⬆️
latest-uploader-overall 97.45% <100.00%> (+<0.01%) ⬆️
unit 97.45% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.46% <100.00%> (+<0.01%) ⬆️
OutsideTasks 97.73% <ø> (ø)
Files Coverage Δ
tasks/preprocess_upload.py 91.35% <100.00%> (+0.10%) ⬆️
tasks/upload.py 91.22% <100.00%> (+0.02%) ⬆️

Copy link

codecov bot commented Jun 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.48%. Comparing base (4840ca4) to head (7d65468).

Changes have been made to critical files, which contain lines commonly executed in production. Learn more

✅ All tests successful. No failed tests found.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #502   +/-   ##
=======================================
  Coverage   97.48%   97.48%           
=======================================
  Files         447      447           
  Lines       35488    35490    +2     
=======================================
+ Hits        34595    34597    +2     
  Misses        893      893           
Flag Coverage Δ
integration 97.45% <100.00%> (+<0.01%) ⬆️
latest-uploader-overall 97.45% <100.00%> (+<0.01%) ⬆️
unit 97.45% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.51% <100.00%> (+<0.01%) ⬆️
OutsideTasks 97.73% <ø> (ø)
Files Coverage Δ
tasks/preprocess_upload.py 91.46% <100.00%> (+0.10%) ⬆️
tasks/upload.py Critical 91.27% <100.00%> (+0.02%) ⬆️
Related Entrypoints
run/app.tasks.upload.Upload
run/app.tasks.upload.PreProcessUpload

Copy link
Contributor

@giovanni-guidini giovanni-guidini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through the problem again I'm realizing that this solution can lead us into more problems than before 😅

Please read the comments for explanation

@@ -48,7 +49,8 @@ def run_impl(
"Received preprocess upload task",
extra=dict(repoid=repoid, commit=commitid, report_code=report_code),
)
lock_name = f"preprocess_upload_lock_{repoid}_{commitid}_{report_code}"
lock_name = UPLOAD_LOCK_NAME(repoid, commitid)
# lock_name = f"preprocess_upload_lock_{repoid}_{commitid}_{report_code}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why keep it as a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caaaause I forgot 🙃

@@ -48,7 +49,8 @@ def run_impl(
"Received preprocess upload task",
extra=dict(repoid=repoid, commit=commitid, report_code=report_code),
)
lock_name = f"preprocess_upload_lock_{repoid}_{commitid}_{report_code}"
lock_name = UPLOAD_LOCK_NAME(repoid, commitid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be useful to add a comment explaining why this task needs the same lock as the upload task

blocking_timeout=None,
# This is the timeout that this task will wait to wait for the lock. This should
# be non-zero as otherwise it waits indefinitely to get the lock, and ideally smaller than
# the blocking timeout for the upload task so this one goes first, although it can go second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the order that matters. The reason that the blocking timeout should be smaller is to avoid the scenario in which the PreProcess tasks will wait forever and get the locks. This can drain Upload tasks (because they wait a limited time, and retry a limited amount of times).

If only PreProcess tasks run for a commit that's a problem. If only Upload tasks run it's OK.
We know that the number of Upload tasks is always >= than the number of PreProcess tasks for a given commit.
We know that the Upload tasks retry if they can't get the lock, but the PreProcess doesn't.
By forcing the PreProcess tasks to collectively wait less time for the lock we can guarantee that an Upload task will run and everything will be OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also the reason why nuking the PreProcess task makes sense.
It avoids this problem completely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda realizing now that the current setup could lead us into trouble...

Consider this scenario:
1 Preprocess and 1 Upload tasks execute at roughly the same time. PreProcess get's the lock. It can stay with the lock for a maximum of 5minutes. Let's say it keeps the lock for 5 minutes.

Upload waits 5s, doesn't get the lock. It retries (after 20s). Then it waits 5 more seconds to get the lock and fails. It will retry again (after 40s), and wait 5 more seconds. Then it will not retry again.
In total the task waiter 15s + 20s + 40s = 1m15s

In this scenario no Upload tasks would run for the commit and nothing would be processed

Copy link
Contributor Author

@adrian-codecov adrian-codecov Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, so if we just ensure the upload task waits at least 5 minutes in the worst case scenario, that would fix this right?

Every unsuccessful run goes through blocking_timeout (0) + retry_countdown (20 * 2**self.request.retries), so it goes like

0 retries: 5 + 20 = 25
1 retries: 5 + 40 = 45 (total 70)
2 retries: 5 + 204 = 85 (total 155)
3 retries: 5 + 20
8 = 165 (total 320)

So if we'd allow this to retry at most 3 times, this could solve this, or we could tweak the specific values of retrying. That's one way.

Another is to shorten the time during preprocess and fit the upload retries around that.

There's the elimination of the preprocess task altogether, which I can ask other people/engs but I suspect that will be met with more friction + unforeseen things cause we're suddenly unsupporting a command. Although if we fundamentally think this task is more harmful than not, it's worth pushing back on getting rid of it.

Let me know what you think about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would push for the elimination of the preprocess task altogether. It would be a good idea to get other opinions from the team, definetely.

It is not true that "we're suddenly unsupporting a command" if we drop that task. The api would still create an empty report.

In the meantime we can go with the proposal in the other PR to alleviate customer's problems (although for the record I don't particularly like that solution, but that's just 1 opinion)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants