-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A non-leader unit stuck in "awaiting for member to start" #560
Comments
From another testrun that hit the same error on the unit that fails to connect we have this log message where the connection is getting reset by peer while trying to connect. This reset connection happens on the other units once while they are coming up, but it is repeated ad nauseam in the failed unit where it never connects. I'm not seeing anything else in the logs that signals to a service not starting or otherwise, but it seems like postgres doesnt start locally, so therefore it cant get health?
|
Hi, @jeffreychang911 and @asbalderson! Thanks for the report. This issue was scheduled for the next pulse. |
Hi, @jeffreychang911 and @asbalderson! Do you have any environment we could access to reproduce this issue? I tried both on a VM and a PS6 model but couldn't reproduce it. |
I checked our test log, this issue only happened twice in July with rev 281. We didn't see that in last 90+ runs since. |
We have another run with similar symptom, non-leader node stuck in "awaiting for member to start" unit-postgresql-k8s-1: 2024-09-06 00:29:09 ERROR unit.postgresql-k8s/1.juju-log Uncaught exception while in charm code: During handling of the above exception, another exception occurred: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
Just for the history, the same issue reported in https://warthogs.atlassian.net/browse/DPE-4589 |
We are not seeing SSLErrors recently. |
Steps to reproduce
Expected behavior
Postgresql-k8s charm should settle shortly after deploy.
Actual behavior
Juju status found one unit stuck in waiting until timeout in 1 hr.
Unit Workload Agent Address Ports Message
data-integrator/0* active idle 192.168.254.204
postgresql-k8s/0 waiting executing 192.168.252.201 awaiting for member to start
postgresql-k8s/1* active idle 192.168.253.201 Primary
postgresql-k8s/2 active idle 192.168.254.203
self-signed-certificates/0* active idle 192.168.252.200
Only found one ERROR from juju debug-log
unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)
Versions
Operating system: Jammy
Juju CLI: 3.5.2
Juju agent: 3.5.2
Charm revision: postgresql-k8s charm rev 281
Charmed Kubernetes 1.30/beta, and would be 1.30/stable soon without change.
Log output
Juju debug log:
unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)
Additional context
This is found in a SolQA run, https://solutions.qa.canonical.com/testruns/5dc43cf9-2211-4b4c-9a69-a39d4d61176e
Crashdump - https://oil-jenkins.canonical.com/artifacts/5dc43cf9-2211-4b4c-9a69-a39d4d61176e/generated/generated/postgresql-k8s/crashdump-2024-07-11-08.49.08.tar.gz
The text was updated successfully, but these errors were encountered: