A non-leader unit stuck in "awaiting for member to start" #560

jeffreychang911 · 2024-07-11T20:44:22Z

Steps to reproduce

SolQA deployed Charmed Kubernetes 1.30/beta on top of Charmed Openstack Yoga, and then deploy mysql-k8s and postgresql-k8s in different model on top of Charmed K8s.
self-signed-certificates and data-integrator are both deployed and related.

Expected behavior

Postgresql-k8s charm should settle shortly after deploy.

Actual behavior

Juju status found one unit stuck in waiting until timeout in 1 hr.
Unit Workload Agent Address Ports Message
data-integrator/0* active idle 192.168.254.204
postgresql-k8s/0 waiting executing 192.168.252.201 awaiting for member to start
postgresql-k8s/1* active idle 192.168.253.201 Primary
postgresql-k8s/2 active idle 192.168.254.203
self-signed-certificates/0* active idle 192.168.252.200

Only found one ERROR from juju debug-log
unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)

Versions

Operating system: Jammy

Juju CLI: 3.5.2

Juju agent: 3.5.2

Charm revision: postgresql-k8s charm rev 281

Charmed Kubernetes 1.30/beta, and would be 1.30/stable soon without change.

Log output

Juju debug log:
unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)

Additional context

This is found in a SolQA run, https://solutions.qa.canonical.com/testruns/5dc43cf9-2211-4b4c-9a69-a39d4d61176e
Crashdump - https://oil-jenkins.canonical.com/artifacts/5dc43cf9-2211-4b4c-9a69-a39d4d61176e/generated/generated/postgresql-k8s/crashdump-2024-07-11-08.49.08.tar.gz

github-actions · 2024-07-11T20:44:36Z

https://warthogs.atlassian.net/browse/DPE-4882

asbalderson · 2024-07-18T22:50:02Z

From another testrun that hit the same error on the unit that fails to connect we have this log message where the connection is getting reset by peer while trying to connect.

This reset connection happens on the other units once while they are coming up, but it is repeated ad nauseam in the failed unit where it never connects.

I'm not seeing anything else in the logs that signals to a service not starting or otherwise, but it seems like postgres doesnt start locally, so therefore it cant get health?

2024-07-18T08:43:30.530867708Z stdout F 2024-07-18T08:43:30.530Z [postgresql] Exception in thread Thread-1802 (process_request_thread):
2024-07-18T08:43:30.530886993Z stdout F 2024-07-18T08:43:30.530Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:30.5308897Z stdout F 2024-07-18T08:43:30.530Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:30.531229302Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self.run()
2024-07-18T08:43:30.531240374Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:30.53139845Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:30.531405551Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:30.531709129Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     request.do_handshake()
2024-07-18T08:43:30.53171517Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:30.531993036Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:30.531999241Z stdout F 2024-07-18T08:43:30.531Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:33.54831443Z stdout F 2024-07-18T08:43:33.548Z [postgresql] Exception in thread Thread-1803 (process_request_thread):
2024-07-18T08:43:33.548336755Z stdout F 2024-07-18T08:43:33.548Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:33.54833941Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:33.548341374Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self.run()
2024-07-18T08:43:33.548343111Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:33.54834474Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:33.548346971Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:33.548348695Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     request.do_handshake()
2024-07-18T08:43:33.548350325Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:33.548351945Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:33.54835357Z stdout F 2024-07-18T08:43:33.548Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:36.555747743Z stdout F 2024-07-18T08:43:36.555Z [postgresql] Exception in thread Thread-1804 (process_request_thread):
2024-07-18T08:43:36.555893062Z stdout F 2024-07-18T08:43:36.555Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:36.555900221Z stdout F 2024-07-18T08:43:36.555Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:36.556189845Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     self.run()
2024-07-18T08:43:36.556289238Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:36.556511996Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:36.556518628Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:36.556826132Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     request.do_handshake()
2024-07-18T08:43:36.556887803Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:36.557129105Z stdout F 2024-07-18T08:43:36.557Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:36.557194273Z stdout F 2024-07-18T08:43:36.557Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:38.899227117Z stderr F 2024-07-18T08:43:38.899Z [pebble] GET /v1/notices?after=2024-07-18T08%3A32%3A08.874364511Z&timeout=30s 30.000151934s 200
2024-07-18T08:43:39.571698564Z stdout F 2024-07-18T08:43:39.571Z [postgresql] Exception in thread Thread-1805 (process_request_thread):
2024-07-18T08:43:39.571720471Z stdout F 2024-07-18T08:43:39.571Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:39.571752036Z stdout F 2024-07-18T08:43:39.571Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:39.572014237Z stdout F 2024-07-18T08:43:39.571Z [postgresql]     self.run()
2024-07-18T08:43:39.572085537Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:39.572296492Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:39.572327798Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:39.572598412Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     request.do_handshake()
2024-07-18T08:43:39.572644153Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:39.572947519Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:39.572955647Z stdout F 2024-07-18T08:43:39.572Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:40.179252032Z stdout F 2024-07-18T08:43:40.179Z [postgresql] Exception in thread Thread-1806 (process_request_thread):
2024-07-18T08:43:40.17927862Z stdout F 2024-07-18T08:43:40.179Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:40.179310988Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:40.179593344Z stdout F 2024-07-18T08:43:40.179Z [postgresql]     self.run()
2024-07-18T08:43:40.179601942Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:40.179805582Z stdout F 2024-07-18T08:43:40.179Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:40.179841797Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:40.18015769Z stdout F 2024-07-18T08:43:40.180Z [postgresql]     request.do_handshake()
2024-07-18T08:43:40.180202199Z stdout F 2024-07-18T08:43:40.180Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:40.180470562Z stdout F 2024-07-18T08:43:40.180Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:40.180539751Z stdout F 2024-07-18T08:43:40.180Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:40.180758071Z stderr F 2024-07-18T08:43:40.180Z [pebble] Check "postgresql" failure 403 (threshold 3): Get "http://postgresql-k8s-1.postgresql-k8s-endpoints:8008/health": read tcp 192.168.252.69:35004->192.168.252.69:8008: read: connection reset by peer

marceloneppel · 2024-07-25T12:57:30Z

Hi, @jeffreychang911 and @asbalderson! Thanks for the report. This issue was scheduled for the next pulse.

marceloneppel · 2024-08-19T20:17:22Z

Hi, @jeffreychang911 and @asbalderson! Do you have any environment we could access to reproduce this issue? I tried both on a VM and a PS6 model but couldn't reproduce it.

jeffreychang911 · 2024-08-19T20:31:43Z

I checked our test log, this issue only happened twice in July with rev 281. We didn't see that in last 90+ runs since.

jeffreychang911 · 2024-09-09T15:28:51Z

We have another run with similar symptom, non-leader node stuck in "awaiting for member to start"
crashdump here - https://oil-jenkins.canonical.com/artifacts/5630363a-a6f5-491e-90b2-d4377a5ffb32/generated/generated/postgresql-k8s/crashdump-2024-09-06-01.25.24.tar.gz
but debug log is bit different

unit-postgresql-k8s-1: 2024-09-06 00:29:09 ERROR unit.postgresql-k8s/1.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 1099, in _validate_conn
conn.connect()
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 653, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 806, in ssl_wrap_socket_and_match_hostname
ssl_sock = ssl_wrap_socket(
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/ssl.py", line 465, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls, server_hostname)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/ssl.py", line 509, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/lib/python3.10/ssl.py", line 513, in wrap_socket
return self.sslsocket_class._create(
File "/usr/lib/python3.10/ssl.py", line 1100, in _create
self.do_handshake()
File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 491, in _make_request
raise new_e
urllib3.exceptions.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/adapters.py", line 667, in send
resp = conn.urlopen(
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='postgresql-k8s-1.postgresql-k8s-endpoints', port=8008): Max retries exceeded with url: /cluster (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/src/patroni.py", line 136, in get_primary
r = requests.get(f"{url}/cluster", verify=self._verify, timeout=5)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/adapters.py", line 698, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='postgresql-k8s-1.postgresql-k8s-endpoints', port=8008): Max retries exceeded with url: /cluster (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)')))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 1806, in
main(PostgresqlOperatorCharm, use_juju_for_storage=True)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/main.py", line 544, in main
manager.run()
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/main.py", line 520, in run
self._emit()
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/main.py", line 506, in _emit
self.framework.reemit()
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/framework.py", line 861, in reemit
self._reemit()
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/framework.py", line 941, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 838, in _on_postgresql_pebble_ready
self.enable_disable_extensions()
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 581, in enable_disable_extensions
if self._patroni.get_primary() is None:
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/src/patroni.py", line 133, in get_primary
for attempt in Retrying(stop=stop_after_attempt(len(self._endpoints) + 1)):
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/init.py", line 435, in iter
do = self.iter(retry_state=retry_state)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/init.py", line 368, in iter
result = action(retry_state)
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/init.py", line 411, in exc_check
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7fdd57f1d120 state=finished raised SSLError>]

taurus-forever · 2024-09-16T13:27:02Z

Just for the history, the same issue reported in https://warthogs.atlassian.net/browse/DPE-4589
It looks like it is no longer reproducible with the latest 14/stable revison 382.
@jeffreychang911 can you confirm the same from your side? Tnx!

jeffreychang911 · 2024-11-06T22:57:33Z

We are not seeing SSLErrors recently.
However, "awaiting for member to start" still pops up sometimes.
Here's a most recent occurrence with rev 381, https://solutions.qa.canonical.com/testruns/191337f5-d2a6-4cf5-9a9b-f9e6f654f540
pod logs available from https://oil-jenkins.canonical.com/artifacts/191337f5-d2a6-4cf5-9a9b-f9e6f654f540/generated/generated/kubernetes-openstack/juju-crashdump-kubernetes-openstack-2024-11-06-18.06.52.tar.gz

jeffreychang911 added the bug Something isn't working label Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A non-leader unit stuck in "awaiting for member to start" #560

A non-leader unit stuck in "awaiting for member to start" #560

jeffreychang911 commented Jul 11, 2024 •

edited

Loading

github-actions bot commented Jul 11, 2024

asbalderson commented Jul 18, 2024

marceloneppel commented Jul 25, 2024

marceloneppel commented Aug 19, 2024

jeffreychang911 commented Aug 19, 2024

jeffreychang911 commented Sep 9, 2024

taurus-forever commented Sep 16, 2024 •

edited

Loading

jeffreychang911 commented Nov 6, 2024

A non-leader unit stuck in "awaiting for member to start" #560

A non-leader unit stuck in "awaiting for member to start" #560

Comments

jeffreychang911 commented Jul 11, 2024 • edited Loading

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

github-actions bot commented Jul 11, 2024

asbalderson commented Jul 18, 2024

marceloneppel commented Jul 25, 2024

marceloneppel commented Aug 19, 2024

jeffreychang911 commented Aug 19, 2024

jeffreychang911 commented Sep 9, 2024

taurus-forever commented Sep 16, 2024 • edited Loading

jeffreychang911 commented Nov 6, 2024

jeffreychang911 commented Jul 11, 2024 •

edited

Loading

taurus-forever commented Sep 16, 2024 •

edited

Loading