Failure vs Error in regards to RetryPolicy #10429

jmccrumm · 2023-01-30T17:50:40Z

jmccrumm
Jan 30, 2023

Where is the line drawn between an Error and a Failure when it comes to the retryPolicy for a Retry Strategy?

I ask because we seem to have a lot of intermittent problems with some of our workflows connecting to our internal GitLab instance, in which case a hit of the 'Retry' button in Argo Workflows always solves it. I attempted to put some retry strategies in our workflows so that we don't have to manually retry when these intermittent issues occur, but I set the retryPolicy to OnError, and it seems there are some cases that are considered Failure when I would think they'd be Errors. For instance, we get artifact <blah> failed to load...ssh: handshake failed: EOF, but this is marked as a Failure, when it didn't even get to executing the actual script for that step. I could set the retryPolicy to Always, but I also don't want it to keep retrying when there is a true failure when running the script itself.

Also, what is OnTransientError? If that case occurs, will the Phase within the workflow show exactly that?

Thanks.

I am running v3.4.4 of Argo Workflows.

wesleyscholl · 2024-12-14T02:48:38Z

wesleyscholl
Dec 14, 2024

@jmccrumm

Failure refers to scenarios where the main container of a step executes but terminates with a non-zero exit code, indicating that the process encountered an issue during execution. The default retryPolicy, OnFailure, retries steps whose main container is marked as failed in Kubernetes.

Error, refers to issues that prevent the step from executing its main task. Problems like failures in init or wait containers, or workflow controller errors. The OnError retryPolicy retries steps that encounter such errors.

In your situation, where intermittent Gitlab issues result in errors like: “artifact failed to load...ssh: handshake failed: EOF”, should be classified as Errors because the main container hasn’t executed the primary script. Therefore, setting the retryPolicy to OnError should, in theory, address these issues.

However, if you’re observing errors are marked as Failures, it might be due to how the system interprets specific error conditions.

The OnTransientError, policy retries steps that encounter errors defined as transient. Examples include I/O or TLS handshake timeouts. These are classified as a subset of errors that are primarily temporary and may succeed by retrying. If a step fails with a transient error, the workflow’s phase will reflect this, and the step will be retried if the OnTransientError retryStrategy has been defined.

Resources:

https://argo-workflows.readthedocs.io/en/release-3.4/retries

https://github.com/argoproj/argo-workflows/blob/main/examples/retry-on-error.yaml

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure vs Error in regards to RetryPolicy #10429

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Failure vs Error in regards to RetryPolicy #10429

jmccrumm Jan 30, 2023

Replies: 1 comment

wesleyscholl Dec 14, 2024

jmccrumm
Jan 30, 2023

wesleyscholl
Dec 14, 2024