Unify retry behaviour of interruptible tasks #3793

fellhorn · 2023-06-20T14:37:01Z

fellhorn
Jun 20, 2023

Authors:

1 Executive Summary

Flyte implements a retry mechanism to make workflows robust against failure. This retry mechanism has two different budgets, one for which the user defines the maximum number of retries in the @task decorator and one for system failures, which is defined on the platform side.

We distinguish between Flytekit/system-level exceptions and user exceptions. For instance, if Flytekit encounters an issue while uploading outputs, it is considered a system exception. On the other hand, if a user raises a ValueError due to an unexpected input in the task code, it is classified as a user exception. (Source)

Especially when it comes to interruptions/node terminations, the details of the retry behaviour (which budget a retry counts against and how many retry possibilities are remaining) are intransparent and difficult to understand. The behavior is unfortunately also not consistent between plugins or even within the Pod plugin.

2 Motivation

We use interruptible tasks for most of our workloads as we implemented robust checkpointing and saw great cost savings.

For regular PythonFunctionTasks the interruptible behavior (in case of a preemption) is as follows:

The Pod plugin tries to demistify failures and in case a preemption is detected, returns a system retriable failure.

For retries number 1 to SystemRetryBudget-1 it runs on a Spot instance. Failed attempts are shown to the user as follows:

The last retry happens on a non-interruptible machine to ensure completion of the task

(Unintuitively, in case of a SIGKILL received during node termination, the failure is counted towards the user defined retry budget in contrast to a preemption.)

When using for instance the kubeflow operators plugin (e.g. Pytorch Task) all preemptions are counted towards the user retry budget. The last attempt is not performed on a non-preemptible node.

Preempted attempts are shown as follows in the console:

The incoherent behaviour is intransparent and counterintuitive for platform users. As you can see in the previous screenshots the user can't distinguish which retry budget an attempt counted against. We as platform engineers have been approached multiple times with questions such as: "Why did my task retry only x times when I specified y times."

We would like Flyte to handle interruptible tasks the same, no matter which plugin is responsible.
We would like the retry mechanism and budgets to be easier to understand for platform users.

3 Proposed Implementation

Below we contrast different approaches to simplify and unify the behaviour:

3.1 Counting preemptions towards user retry budget

Several of our platform users were surprised to learn that the retries parameter in the @task decorator has no effect on how many times a task can get preempted - at least in the case of python tasks (not in the case of e.g. pytorch tasks). The actual use case for retries (by raising FlyteRecoverableException) is never used by our engineers.

For longer trainings, a larger number of preemptions can be acceptable to a user. Some of our users remarked they would like to have control over the maximum number of preemptions.

We propose to

count preemptions against the user retry budget when interruptible=True here by returning a PhaseInfoRetryableFailure instead of a PhaseInfoSystemRetryableFailure.
- This would mean that most users wouldn't need to know about the system retry budget and they could adapt the number of allowed preemptions to the length of their training.
- This would also mean that the Pod plugin and e.g. the kubeflow plugin deal with preemptions the same way.
switch to non-preemptible instances for the last retry of the user retry budget instead of the system retry budget.

3.2 Deprecating system retry budget

This raises the question, which additional value the system retry budget still offers besides a non-zero default value configured by the platform admins.

Unifying the two budgets, setting a default on the platform side, and transparently
allowing the user to override the retry budget per task would make the mechanism a lot easier to understand.

If a user sets a specific number of retries, they will see at most this number of retries.
If a user specifically sets retries to zero, they expect the task to never be retried.
There is no need to distinguish different types of retries in the UI.

We still see the need for a non-zero default value for the retry budget to cover possible infrastructure issues currently covered by the system retry budget (e.g. failed upload of outputs). Not retrying them by default would not be a good choice.

This might be a new platform configuration or could be handled by a default value in flytekit. Flytekit currently has a default value of 0 for retries, a potential issue for backwards compatibility.

3.3 Demistify failure of Pods belonging to plugins/CRDs

The following two approaches don't tackle the problems that 1) users cannot adapt the number of accepted preemptions to the length of their training or 2) that in the Pod plugin a preemption is a system failure but a SIGKILL during node termination is not. However, the approaches do tackle the problem that preemptions are counted towards different budgets in the Pod plugin and e.g. the kubeflow plugin.

3.3.1 Demistify Pod failure in each plugin

In the kubeflow plugin, any job failure is counted towards the user defined retry budget as the status of the kubeflow job CR does not give any insights into whether the respective job failed due to a preemption or not.

To demistify such a failure, one would have to look at the status of the underlying Pods belonging to the kubeflow job. The statuses of these Pods could be retrieved in case of job failure here and passed to flytek8s.DemistifyFailure as is done in the Pod plugin here. If any of its Pods got preempted, the entire job would be seen as preempted.

This approach has two downsides:

So far, in the GetTaskPhase function of a plugin no requests to the Kubernetes API are required. The GetTaskPhase is also expected to be relatively fast and "any operations that might take a long time [...] should be offloaded to the background". That being said, we would only have to do this a single time after the job failed.
While this would align the preemption behaviour of the kubeflow plugin to the Pod plugin, the same would have to be implemented for other plugins that have the same problem: the entire job fails when one "worker" gets preempted but the job itself doesn't show any signs of preemption. (As a side node: e.g. the Spark or Dask plugin don't have this problem because they only make the "worker" Pods preemptible, something that doesn't make sense e.g. with PyTorchJobs.)

3.3.2 Demistify plugin Pod failure centrally

If we want to have the ability to demystify the failure of a job in a plugin by taking the underlying Pods into account 1) without making requests to the Kubernetes API in the GetTaskPhase function of the plugin and 2) without reimplementing the same logic in multiple plugins, we could take the following approach:

Extend the k8s plugin interface with an optional new function that returns a selector/identity resource for the Pods underlying the job CR used by the plugin:

type Plugin interface {
    // Defines a func to create a query object (typically just object and type meta portions) that's used to query k8s
    // resources.
    BuildIdentityResource(ctx context.Context, taskCtx pluginsCore.TaskExecutionMetadata) (client.Object, error)

    // -> Read here <-
    // Optional new function:
    // Defines a func to create a query pod (typically just pod with meta portions and a label tying it to the parent CRD) that is used to query k8s for the pods underlying the job CRD
    BuildIdentityPod(...)

    // Defines a func to create the full resource object that will be posted to k8s.
    BuildResource(ctx context.Context, taskCtx pluginsCore.TaskExecutionContext) (client.Object, error)

    // Analyses the k8s resource and reports the status as TaskPhase. This call is expected to be relatively fast,
    // any operations that might take a long time (limits are configured system-wide) should be offloaded to the
    // background.
    GetTaskPhase(ctx context.Context, pluginContext PluginContext, resource client.Object) (pluginsCore.PhaseInfo, error)

    // Properties desired by the plugin
    GetProperties() PluginProperties
}

In the case of the kubeflow plugin, the new BuildIdentityPod function would return a Pod with such a label training.kubeflow.org/job-name: atk2w95vq7mdrnsq4j2z-n1-0.

(In case the underlying resources are not always Pods, the function could simply have another name.)

Here, in case of job failure in e.g. the kubeflow plugin, we would not return a pluginsCore.PhaseInfoRetryableFailure but a new type of pluginsCore.PhaseInfoDemystifyableFailure.
At some central location in the k8s plugin (tbd) we would then demystify the failure by looking at the underlying Pods and return either a pluginsCore.PhaseInfoRetryableFailure or a pluginsCore.PhaseInfoSystemRetryableFailure.

This would avoid the need for potentially slow requests to the Kubernetes API in the plugin's GetTaskPhase function and the problem could be solved once for all current and future plugins requiring this.

3.4 Inform the user which retry budget an attempt counted against

This only improves the transparency and does not solve the different behavior between plugins. (This section is also void if approach 3.2 is chosen.)

5 Drawbacks

Currently the behaviour is intransparent and confusing for platform users. We as platform engineers have to look deep into the code to understand which budget a certain failure counts against. Users cannot increase the number of accepted preemptions when their training runs longer.
We should not leave these quirks untackled.

6 Alternatives

Currently several alternatives are discussed under point 4. Once it becomes clear which approach is favored, the other approaches will be moved here.

7 Potential Impact and Dependencies

8 Unresolved questions

If approach 3.3.2 is favored, a "location in the k8s plugin" still needs to be identified where the failures of Pods underlying a CR can be demistifyed for all plugins.
Are there better approaches not discussed yet in this RFC?

9 Conclusion

The approaches discussed under 3.3 would unify the behaviour of the plugins but would add additional complexity. Our preferred approach would be 3.1/3.2 as this way the overall complexity would be reduced and the user would get control over the number of acceptable preemptions.

hamersaw · 2023-06-26T20:00:07Z

hamersaw
Jun 26, 2023
Maintainer

Really appreciate the coherent breakdown of this issue and I am excited about simplifying this. Was there conversation in the most recent contributors meetup? Would be great to have a quick synopsis for those of us that missed. A few thoughts:

for the SIGKILL handling in GKE is this sent for preemptions and only for preemptions? I would be easy enough to update this for a system-level failure.
Flyte already maintains an informer cache for Pods because of the Pod plugin. I'm wondering if all Pods created through kf operators are already in that k8s cache. So k8s apiserver Pod lookups to discover preemption may be read locally.
I would need to think much deeper on this, but backwards compatibility is a large issue if we decide to deprecate the system retry budget or reapply retries to user budget.

1 reply

fg91 Jun 26, 2023
Collaborator

Was there conversation in the most recent contributors meetup?

I presented the RFC proposal but since the details are difficult to wrap one's head around when not looking at the code, we agreed that we would discuss it next time so that everybody can read it.

for the SIGKILL handling in GKE is this sent for preemptions and only for preemptions?

I can't tell for sure unfortunately that there isn't any other scenario where this is sent. We did several experiments to figure out which events are sent on preemptions, node deletion, node shutdown... The exact behaviour is not/or poorly documented. gcloud compute stop <instance name> and gcloud compute simulate-maintenance-event should both simulate a preemption according to the docs. But the actual events seen in the pod's status are different. We waited for actual preemptions several times, the behaviour is consistent with gcloud compute instance stop ... and not with simulated maintenance event. (Flyte's behaviour targets the events seen when stopping the instance/actual preemptions since @bstadlbauer's recent fix). I think (but can't prove) that the GKE version also plays a role, at least I don't think the mentioned fix was needed "a while back". So, difficult to say for sure ^^

I'm wondering if all Pods created through kf operators are already in that k8s cache.

Could you please point me to a location in the code where I could check this? Happy to do an experiment.

but backwards compatibility is a large issue if we decide to deprecate the system retry budget or reapply retries to user budget.

Fully understood. If we decide that this is the way forward, maybe we could handle this with a flag in the helm values that allows one to switch to the new behaviour? And then in a more distant future depracate the then legacy behaviour?

fg91 · 2023-07-18T16:56:14Z

fg91
Jul 18, 2023
Collaborator

For reference, flyteorg/flytekit#1732 addresses the retry behaviour of shell tasks:

the error thrown by ShellTask is currently classified as system error instead of user error.

0 replies

hamersaw · 2023-07-25T19:06:07Z

hamersaw
Jul 25, 2023
Maintainer

A quick update from our discussion today:
We have agreed that 3.2 is probably the best course. Basically, it simplifies retries by disambiguating between user and system classifications. So there is a single max number of retries regardless of cause. This will require a surprisingly small amount of changes to Flyte:

Add a configuration flag in FlytePropeller for counting system and user retries the same. This will be applied during the isElgiibleForRetry function where, in contrast to the current logic, if the flag is set then we check if the number of attempts + number of system failures > max number of retries (set in the task decorator). This flag will be false by default to ensure the current behavior remains the default.
Add configuration for default number of retries in FlytePropeller. Currently the default maximum number of attempts is hardcoded. This should be exposed in configuration.
Add interruptibleFailureThreshold to task-level metadata. This configuration option determines how many attempts are allowed to be interruptible. If we are effectively allowing users to update the number of system-level retries in the task decorator (by deprecating the difference between causes), then we need to allow configuration of this value at the task-level. In addition to adding the configuration, we will need to update this logic.

An open question is for the interruptibleFailureThreshold, currently this value is a positive integer that is the number of retries that are interruptible. For example, 4 retries allowed with 3 allowed to be interruptible. Defining it this way does not make as much sense when we allow for task-level configuration of the retries. A more elegant approach would be the number of retries that ARE NOT interruptible. For example a configuration of nonInterruptibleRetries that is set to 1 means that the last retry is not interrutpible regardless of how many retries there are. Adding this flag while ensuring backwards compatibility will require further discussion.

0 replies

fg91 · 2023-07-27T16:36:02Z

fg91
Jul 27, 2023
Collaborator

Closing in favour of RFC #3902

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify retry behaviour of interruptible tasks #3793

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unify retry behaviour of interruptible tasks #3793

fellhorn Jun 20, 2023

1 Executive Summary

2 Motivation

3 Proposed Implementation

3.1 Counting preemptions towards user retry budget

3.2 Deprecating system retry budget

3.3 Demistify failure of Pods belonging to plugins/CRDs

3.3.1 Demistify Pod failure in each plugin

3.3.2 Demistify plugin Pod failure centrally

3.4 Inform the user which retry budget an attempt counted against

5 Drawbacks

6 Alternatives

7 Potential Impact and Dependencies

8 Unresolved questions

9 Conclusion

Replies: 4 comments · 1 reply

hamersaw Jun 26, 2023 Maintainer

fg91 Jun 26, 2023 Collaborator

fg91 Jul 18, 2023 Collaborator

hamersaw Jul 25, 2023 Maintainer

fg91 Jul 27, 2023 Collaborator

fellhorn
Jun 20, 2023

Replies: 4 comments 1 reply

hamersaw
Jun 26, 2023
Maintainer

fg91 Jun 26, 2023
Collaborator

fg91
Jul 18, 2023
Collaborator

hamersaw
Jul 25, 2023
Maintainer

fg91
Jul 27, 2023
Collaborator