You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 6, 2023. It is now read-only.
When the ACME server takes more than 60 seconds to sign a cert, I've noticed openshift-acme gets into an odd state that it can't recover from without manual intervention. We have this happen somewhat regularly.
What happened:
Cert renewal was automatically started. Authorizations were validated, order moved to ready state, openshift-acme submitted the signing request. One minute later, openshift-acme errors out:
I0320 12:18:15.052950 1 route.go:650] Route "webapp/webapp.example.com": Order "https://acme-server.example.com/acme/v2/orders/uesdjZX9XbLRLiyaAv3UyA" is in "ready" state
I0320 12:18:15.053020 1 route.go:1070] Route "webapp/webapp.example.com": Order "https://acme-server.example.com/acme/v2/orders/uesdjZX9XbLRLiyaAv3UyA" successfully validated
I0320 12:19:14.934962 1 route.go:498] Finished syncing Route "webapp/webapp.example.com"
E0320 12:19:14.935082 1 route.go:1308] webapp/webapp.example.com failed with : can't create cert order: context deadline exceeded
I0320 12:19:14.940581 1 route.go:496] Started syncing Route "webapp/webapp.example.com"
I0320 12:19:14.941364 1 route.go:563] Route "webapp/webapp.example.com" needs new certificate: Proactive renewal
I0320 12:19:14.941877 1 route.go:607] Using ACME client with DirectoryURL "https://acme-server.example.com/acme/v2/directory"
I0320 12:19:15.045235 1 route.go:650] Route "webapp/webapp.example.com": Order "https://acme-server.example.com/acme/v2/orders/uesdjZX9XbLRLiyaAv3UyA" is in "valid" state
I0320 12:19:15.045285 1 route.go:498] Finished syncing Route "webapp/webapp.example.com"
I0320 12:19:24.631587 1 reflector.go:432] k8s.io/[email protected]/tools/cache/reflector.go:108: Watch close - *v1.Service total 0 items received
I0320 12:19:56.720177 1 reflector.go:432] k8s.io/[email protected]/tools/cache/reflector.go:108: Watch close - *v1.ReplicaSet total 0 items received
I0320 12:19:56.796697 1 reflector.go:338] k8s.io/[email protected]/tools/cache/reflector.go:108: watch of *v1.ReplicaSet ended with: The resourceVersion for the provided watch is too old.
I0320 12:19:57.796935 1 reflector.go:188] Listing and watching *v1.ReplicaSet from k8s.io/[email protected]/tools/cache/reflector.go:108
The cert is usually signed shortly after this and the order is set to valid on the ACME server. However, the acme.openshift.io/status on the route lists the order status as pending.
After future restarts of the pod, we'll see log lines like:
I0418 14:56:36.907743 1 route.go:563] Route "webapp/webapp.example.com" needs new certificate: In renewal period
I0418 14:56:36.907960 1 route.go:559] Route "webapp/www.webapp.example.com" doesn't need new certificate.
I0418 14:56:36.908105 1 route.go:607] Using ACME client with DirectoryURL "https://acme-server.example.com/acme/v2/directory"
I0418 14:56:36.908137 1 route.go:498] Finished syncing Route "webapp/www.webapp.example.com"
I0418 14:56:37.047642 1 route.go:650] Route "webapp/webapp.example.com": Order "https://acme-server.example.com/acme/v2/orders/uesdjZX9XbLRLiyaAv3UyA" is in "valid" state
I0418 14:56:37.047704 1 route.go:498] Finished syncing Route "webapp/webapp.example.com"
What you expected to happen:
I expected openshift-acme to retry after the timeout. It could either try to download the signed cert or even throw out the previous order and start over. Either option would be preferable to the current situation.
How to reproduce it (as minimally and precisely as possible):
Have a cert signing take over a minute.
Unfortunately I can't make our acme server publicly available. You may need to add a sleep or artificially short timeout to test this against letsencrypt.
Anything else we need to know?:
Whenever this happens, we can resolve the issue by removing the acme.openshift.io/status annotation from the affected route. It'd be nice to not have to take that manual step.
We have a 3rd-party ACME server that occasionally takes multiple minutes to sign a cert. The duration of the signing process is outside of our control.
When the ACME server takes more than 60 seconds to sign a cert, I've noticed openshift-acme gets into an odd state that it can't recover from without manual intervention. We have this happen somewhat regularly.
What happened:
Cert renewal was automatically started. Authorizations were validated, order moved to ready state, openshift-acme submitted the signing request. One minute later, openshift-acme errors out:
The cert is usually signed shortly after this and the order is set to
valid
on the ACME server. However, theacme.openshift.io/status
on the route lists the order status aspending
.After future restarts of the pod, we'll see log lines like:
What you expected to happen:
I expected openshift-acme to retry after the timeout. It could either try to download the signed cert or even throw out the previous order and start over. Either option would be preferable to the current situation.
How to reproduce it (as minimally and precisely as possible):
Have a cert signing take over a minute.
Unfortunately I can't make our acme server publicly available. You may need to add a sleep or artificially short timeout to test this against letsencrypt.
Anything else we need to know?:
Whenever this happens, we can resolve the issue by removing the
acme.openshift.io/status
annotation from the affected route. It'd be nice to not have to take that manual step.We have a 3rd-party ACME server that occasionally takes multiple minutes to sign a cert. The duration of the signing process is outside of our control.
@tnozicka
The text was updated successfully, but these errors were encountered: