Standardize the binary job timeouts at 150 minutes. #311

clalancette · 2024-08-02T18:01:02Z

Currently, Ubuntu amd64 and RHEL amd64 binary jobs are at 120 minutes, while Ubuntu aarch64 binary jobs are at 720 minutes. That latter timeout is probably a legacy of when the aarch64 jobs used qemu to build.

This controversial PR changes all of them to 150 minutes. To be clear, that is a 30 minute increase in the case of amd64, and a 570 minute decrease in the case of aarch64.

My reasoning here is as follows:

We know we have jobs on the buildfarm that "sometimes" complete in their current 120 minute timeslot. When they run out of time, they usually would have finished in the next 10 minutes.
aarch64 jobs have been set to 720 minutes for a while, but we don't see a huge number of binary jobs taking longer than 150 minutes there. So I think our risk of lots of packages suddenly increasing their time is limited.
This should materially improve the experience of the ROS Bosses, as there will be fewer jobs they have to run "by hand" to have regression-free syncs.

Currently, Ubuntu amd64 and RHEL amd64 binary jobs are at 120 minutes, while Ubuntu aarch64 binary jobs are at 720 minutes. That latter timeout is probably a legacy of when the aarch64 jobs used qemu to build. This controversial PR changes all of them to 150 minutes. To be clear, that is a 30 minute *increase* in the case of amd64, and a 570 minute decrease in the case of aarch64. My reasoning here is as follows: 1. We know we have jobs on the buildfarm that "sometimes" complete in their current 120 minute timeslot. When they run out of time, they usually would have finished in the next 10 minutes. 2. aarch64 jobs have been set to 720 minutes for a while, but we don't see a huge number of binary jobs taking longer than 150 minutes there. So I think our risk of lots of packages suddenly increasing their time is limited. 3. This should materially improve the experience of the ROS Bosses, as there will be fewer jobs they have to run "by hand" to have regression-free syncs. Signed-off-by: Chris Lalancette <[email protected]>

clalancette · 2024-08-02T18:01:39Z

FYI @audrow @Yadunund @marcoag . This has the potential to fix some issues with the distributions, but there is also some potential for problems (particularly on arm64).

cottsay

I support this. For aarch64, I've often observed those jobs finish FASTER than the amd64 after we got off from qemu. While bumping the timeout on the rest of the jobs could be viewed as a "slippery slope", I think that we're probably wasting far more compute time retrying these jobs today than we would waste if even a handful of them were to start consistently succeeding within the 30 minute extension.

The hard limit doesn't save us very much compute if we don't take action on a package that consistently hits it and retries the next day only to waste another 120 min of compute.

clalancette requested a review from cottsay as a code owner August 2, 2024 18:01

cottsay approved these changes Aug 2, 2024

View reviewed changes

Yadunund approved these changes Aug 5, 2024

View reviewed changes

marcoag approved these changes Aug 6, 2024

View reviewed changes

nuclearsandwich approved these changes Aug 9, 2024

View reviewed changes

nuclearsandwich merged commit 2e5343f into ros2 Aug 9, 2024
3 checks passed

clalancette deleted the clalancette/rationalize-binary-job-timeouts branch August 9, 2024 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize the binary job timeouts at 150 minutes. #311

Standardize the binary job timeouts at 150 minutes. #311

clalancette commented Aug 2, 2024

clalancette commented Aug 2, 2024

cottsay left a comment

Standardize the binary job timeouts at 150 minutes. #311

Standardize the binary job timeouts at 150 minutes. #311

Conversation

clalancette commented Aug 2, 2024

clalancette commented Aug 2, 2024

cottsay left a comment

Choose a reason for hiding this comment