Replies: 1 comment
-
Failure refers to scenarios where the main container of a step executes but terminates with a non-zero exit code, indicating that the process encountered an issue during execution. The default retryPolicy, OnFailure, retries steps whose main container is marked as failed in Kubernetes. Error, refers to issues that prevent the step from executing its main task. Problems like failures in init or wait containers, or workflow controller errors. The OnError retryPolicy retries steps that encounter such errors. In your situation, where intermittent Gitlab issues result in errors like: “artifact failed to load...ssh: handshake failed: EOF”, should be classified as Errors because the main container hasn’t executed the primary script. Therefore, setting the retryPolicy to OnError should, in theory, address these issues. However, if you’re observing errors are marked as Failures, it might be due to how the system interprets specific error conditions. The OnTransientError, policy retries steps that encounter errors defined as transient. Examples include I/O or TLS handshake timeouts. These are classified as a subset of errors that are primarily temporary and may succeed by retrying. If a step fails with a transient error, the workflow’s phase will reflect this, and the step will be retried if the OnTransientError retryStrategy has been defined. Resources: https://argo-workflows.readthedocs.io/en/release-3.4/retries https://github.com/argoproj/argo-workflows/blob/main/examples/retry-on-error.yaml |
Beta Was this translation helpful? Give feedback.
-
Where is the line drawn between an Error and a Failure when it comes to the retryPolicy for a Retry Strategy?
I ask because we seem to have a lot of intermittent problems with some of our workflows connecting to our internal GitLab instance, in which case a hit of the 'Retry' button in Argo Workflows always solves it. I attempted to put some retry strategies in our workflows so that we don't have to manually retry when these intermittent issues occur, but I set the
retryPolicy
toOnError
, and it seems there are some cases that are consideredFailure
when I would think they'd beError
s. For instance, we getartifact <blah> failed to load...ssh: handshake failed: EOF
, but this is marked as aFailure
, when it didn't even get to executing the actual script for that step. I could set theretryPolicy
toAlways
, but I also don't want it to keep retrying when there is a true failure when running the script itself.Also, what is
OnTransientError
? If that case occurs, will thePhase
within the workflow show exactly that?Thanks.
I am running v3.4.4 of Argo Workflows.
Beta Was this translation helpful? Give feedback.
All reactions