Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize the binary job timeouts at 150 minutes. #311

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

clalancette
Copy link

Currently, Ubuntu amd64 and RHEL amd64 binary jobs are at 120 minutes, while Ubuntu aarch64 binary jobs are at 720 minutes. That latter timeout is probably a legacy of when the aarch64 jobs used qemu to build.

This controversial PR changes all of them to 150 minutes. To be clear, that is a 30 minute increase in the case of amd64, and a 570 minute decrease in the case of aarch64.

My reasoning here is as follows:

  1. We know we have jobs on the buildfarm that "sometimes" complete in their current 120 minute timeslot. When they run out of time, they usually would have finished in the next 10 minutes.
  2. aarch64 jobs have been set to 720 minutes for a while, but we don't see a huge number of binary jobs taking longer than 150 minutes there. So I think our risk of lots of packages suddenly increasing their time is limited.
  3. This should materially improve the experience of the ROS Bosses, as there will be fewer jobs they have to run "by hand" to have regression-free syncs.

Currently, Ubuntu amd64 and RHEL amd64 binary jobs are
at 120 minutes, while Ubuntu aarch64 binary jobs are at 720
minutes.  That latter timeout is probably a legacy of when
the aarch64 jobs used qemu to build.

This controversial PR changes all of them to 150 minutes.
To be clear, that is a 30 minute *increase* in the case of
amd64, and a 570 minute decrease in the case of aarch64.

My reasoning here is as follows:
1.  We know we have jobs on the buildfarm that "sometimes"
complete in their current 120 minute timeslot.  When they
run out of time, they usually would have finished in the
next 10 minutes.
2.  aarch64 jobs have been set to 720 minutes for a while,
but we don't see a huge number of binary jobs taking longer
than 150 minutes there.  So I think our risk of lots of
packages suddenly increasing their time is limited.
3.  This should materially improve the experience of the ROS
Bosses, as there will be fewer jobs they have to run "by hand"
to have regression-free syncs.

Signed-off-by: Chris Lalancette <[email protected]>
@clalancette
Copy link
Author

FYI @audrow @Yadunund @marcoag . This has the potential to fix some issues with the distributions, but there is also some potential for problems (particularly on arm64).

Copy link
Member

@cottsay cottsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support this. For aarch64, I've often observed those jobs finish FASTER than the amd64 after we got off from qemu. While bumping the timeout on the rest of the jobs could be viewed as a "slippery slope", I think that we're probably wasting far more compute time retrying these jobs today than we would waste if even a handful of them were to start consistently succeeding within the 30 minute extension.

The hard limit doesn't save us very much compute if we don't take action on a package that consistently hits it and retries the next day only to waste another 120 min of compute.

@nuclearsandwich nuclearsandwich merged commit 2e5343f into ros2 Aug 9, 2024
3 checks passed
@clalancette clalancette deleted the clalancette/rationalize-binary-job-timeouts branch August 9, 2024 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants