Unify retry behaviour of interruptible tasks #3793
Replies: 4 comments 1 reply
-
Really appreciate the coherent breakdown of this issue and I am excited about simplifying this. Was there conversation in the most recent contributors meetup? Would be great to have a quick synopsis for those of us that missed. A few thoughts:
|
Beta Was this translation helpful? Give feedback.
-
For reference, flyteorg/flytekit#1732 addresses the retry behaviour of shell tasks:
|
Beta Was this translation helpful? Give feedback.
-
A quick update from our discussion today:
An open question is for the |
Beta Was this translation helpful? Give feedback.
-
Closing in favour of RFC #3902 |
Beta Was this translation helpful? Give feedback.
-
Authors:
1 Executive Summary
Flyte implements a retry mechanism to make workflows robust against failure. This retry mechanism has two different budgets, one for which the user defines the maximum number of retries in the
@task
decorator and one for system failures, which is defined on the platform side.Especially when it comes to interruptions/node terminations, the details of the retry behaviour (which budget a retry counts against and how many retry possibilities are remaining) are intransparent and difficult to understand. The behavior is unfortunately also not consistent between plugins or even within the Pod plugin.
2 Motivation
We use interruptible tasks for most of our workloads as we implemented robust checkpointing and saw great cost savings.
For regular
PythonFunctionTasks
the interruptible behavior (in case of a preemption) is as follows:The Pod plugin tries to demistify failures and in case a preemption is detected, returns a system retriable failure.
(Unintuitively, in case of a
SIGKILL
received during node termination, the failure is counted towards the user defined retry budget in contrast to a preemption.)When using for instance the kubeflow operators plugin (e.g. Pytorch Task) all preemptions are counted towards the user retry budget. The last attempt is not performed on a non-preemptible node.
Preempted attempts are shown as follows in the console:
The incoherent behaviour is intransparent and counterintuitive for platform users. As you can see in the previous screenshots the user can't distinguish which retry budget an attempt counted against. We as platform engineers have been approached multiple times with questions such as: "Why did my task retry only x times when I specified y times."
3 Proposed Implementation
Below we contrast different approaches to simplify and unify the behaviour:
3.1 Counting preemptions towards user retry budget
Several of our platform users were surprised to learn that the
retries
parameter in the@task
decorator has no effect on how many times a task can get preempted - at least in the case of python tasks (not in the case of e.g. pytorch tasks). The actual use case forretries
(by raisingFlyteRecoverableException
) is never used by our engineers.For longer trainings, a larger number of preemptions can be acceptable to a user. Some of our users remarked they would like to have control over the maximum number of preemptions.
We propose to
interruptible=True
here by returning aPhaseInfoRetryableFailure
instead of aPhaseInfoSystemRetryableFailure
.3.2 Deprecating system retry budget
This raises the question, which additional value the system retry budget still offers besides a non-zero default value configured by the platform admins.
Unifying the two budgets, setting a default on the platform side, and transparently
allowing the user to override the retry budget per task would make the mechanism a lot easier to understand.
We still see the need for a non-zero default value for the retry budget to cover possible infrastructure issues currently covered by the system retry budget (e.g. failed upload of outputs). Not retrying them by default would not be a good choice.
This might be a new platform configuration or could be handled by a default value in flytekit. Flytekit currently has a default value of
0
forretries
, a potential issue for backwards compatibility.3.3 Demistify failure of Pods belonging to plugins/CRDs
The following two approaches don't tackle the problems that 1) users cannot adapt the number of accepted preemptions to the length of their training or 2) that in the Pod plugin a preemption is a system failure but a
SIGKILL
during node termination is not. However, the approaches do tackle the problem that preemptions are counted towards different budgets in the Pod plugin and e.g. the kubeflow plugin.3.3.1 Demistify Pod failure in each plugin
In the kubeflow plugin, any job failure is counted towards the user defined retry budget as the status of the kubeflow job CR does not give any insights into whether the respective job failed due to a preemption or not.
To demistify such a failure, one would have to look at the status of the underlying Pods belonging to the kubeflow job. The statuses of these Pods could be retrieved in case of job failure here and passed to
flytek8s.DemistifyFailure
as is done in the Pod plugin here. If any of its Pods got preempted, the entire job would be seen as preempted.This approach has two downsides:
GetTaskPhase
function of a plugin no requests to the Kubernetes API are required. TheGetTaskPhase
is also expected to be relatively fast and "any operations that might take a long time [...] should be offloaded to the background". That being said, we would only have to do this a single time after the job failed.3.3.2 Demistify plugin Pod failure centrally
If we want to have the ability to demystify the failure of a job in a plugin by taking the underlying Pods into account 1) without making requests to the Kubernetes API in the
GetTaskPhase
function of the plugin and 2) without reimplementing the same logic in multiple plugins, we could take the following approach:Extend the k8s plugin interface with an optional new function that returns a selector/identity resource for the Pods underlying the job CR used by the plugin:
In the case of the kubeflow plugin, the new
BuildIdentityPod
function would return a Pod with such a labeltraining.kubeflow.org/job-name: atk2w95vq7mdrnsq4j2z-n1-0
.(In case the underlying resources are not always Pods, the function could simply have another name.)
Here, in case of job failure in e.g. the kubeflow plugin, we would not return a
pluginsCore.PhaseInfoRetryableFailure
but a new type ofpluginsCore.PhaseInfoDemystifyableFailure
.At some central location in the k8s plugin (tbd) we would then demystify the failure by looking at the underlying Pods and return either a
pluginsCore.PhaseInfoRetryableFailure
or apluginsCore.PhaseInfoSystemRetryableFailure
.This would avoid the need for potentially slow requests to the Kubernetes API in the plugin's
GetTaskPhase
function and the problem could be solved once for all current and future plugins requiring this.3.4 Inform the user which retry budget an attempt counted against
This only improves the transparency and does not solve the different behavior between plugins. (This section is also void if approach 3.2 is chosen.)
5 Drawbacks
Currently the behaviour is intransparent and confusing for platform users. We as platform engineers have to look deep into the code to understand which budget a certain failure counts against. Users cannot increase the number of accepted preemptions when their training runs longer.
We should not leave these quirks untackled.
6 Alternatives
Currently several alternatives are discussed under point 4. Once it becomes clear which approach is favored, the other approaches will be moved here.
7 Potential Impact and Dependencies
8 Unresolved questions
9 Conclusion
The approaches discussed under 3.3 would unify the behaviour of the plugins but would add additional complexity. Our preferred approach would be 3.1/3.2 as this way the overall complexity would be reduced and the user would get control over the number of acceptable preemptions.
Beta Was this translation helpful? Give feedback.
All reactions