-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core feature] Allow flyteadmin to start even if OIDC is unavailable (Improve flyteadmin startup resiliency) #5702
base: master
Are you sure you want to change the base?
Conversation
…dmin startup resiliency) Signed-off-by: ddl-rliu <[email protected]>
e3a64b0
to
080a4cf
Compare
@@ -72,6 +72,7 @@ var ( | |||
"openid", | |||
"profile", | |||
}, | |||
OnlyStartIfOIDCIsAvailable: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configurable onlyStartIfOIDCIsAvailable
is true
by default. This matches the behavior today, where if OIDC is unavailable due to e.g. DNS issues, flyteadmin will not start.
For reference, another project which takes a similar approach to OIDC issues/service startup resiliency is https://github.com/juanfont/headscale/blob/main/hscontrol/app.go#L157-L166
} | ||
} | ||
|
||
return &authCtx, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NewAuthenticationContext
now returns a Context pointer, this way we can add a reusable oidcProvider
to the context, after transient OIDC errors are resolved.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5702 +/- ##
===========================================
- Coverage 60.92% 36.17% -24.76%
===========================================
Files 796 1302 +506
Lines 51689 109627 +57938
===========================================
+ Hits 31494 39660 +8166
- Misses 17288 65822 +48534
- Partials 2907 4145 +1238
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -185,11 +198,20 @@ func NewAuthenticationContext(ctx context.Context, sm core.SecretManager, oauth2 | |||
authCtx.authServiceImpl = authMetadataService | |||
authCtx.identityServiceIml = identityService | |||
|
|||
return authCtx, nil | |||
err = authCtx.InitOIDC() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
authContext
logic which depend on OIDC (i.e. oidcProvider, oidcProvider.Endpoint
) is extracted out into an initOIDC
helper.
@Sovietaced brings up a good point regarding this change. |
Tracking issue
#5701
Why are the changes needed?
Today, the flyteadmin pod is blocked from starting up until the OIDC provider is healthy and available (the pod gets stuck in Error state). In some Kubernetes configurations, this erroring-pod could cause deployment-wide issues. The current behavior could be made more resilient.
(Note that this applies to configurations using
useAuth=true
)What changes were proposed in this pull request?
A better approach in these configurations is to allow flyte to start up, even if the OIDC provider is unavailable. Then, try to re-initialize the OIDC provider later in the deployment lifespan. This is a more resilient approach, and it can be made configurable.
Adds an
onlyStartIfOIDCIsAvailable
config which controls this behavior.How was this patch tested?
A writeup is here which shows the "good" flow when
onlyStartIfOIDCIsAvailable
is enabled and OIDC is unhealthy for a period: https://gist.github.com/ddl-rliu/4c09862404f46a5adbc451025160e0ebSetup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link