Cluster of 3 units has no primary unit #796

kelkawi-a · 2024-12-03T17:02:44Z

Steps to reproduce

Deploy 3 units of postgresql-k8s charm, channel 14/stable revision 281

Expected behavior

The units remain in an active state

Actual behavior

After running fine for a while (i.e. all three units were active and functional", two of the three units became stuck in a waiting/maintenance state with the following status:

postgresql-k8s/0             maintenance      idle         reinitialising replica
postgresql-k8s/1             active                 idle         Primary
postgresql-k8s/2*            maintenance     idle          reinitialising replica

There were no reported outages to the cluster or node restarts. Upon further debugging, it seems that the patroni K8s service disappeared for unknown reason. I do not have the logs for it, but during the debugging process, the cluster itself could not identify a primary, with all three units identified as replicas.

It is worth noting that as part of the recovery process, we tried to re-initialize each of the units, but could not due to the following:

Cluster has no leader, can not reinitialize

We also tried doing a failover with the following output:

root@postgresql-k8s-1:/# curl -s -k https://<unit_ip>:8008/failover -X POST -d '{"candidate":"postgresql-k8s-1"}'

failover is not possible: no good candidates have been found

Versions

Operating system: Ubuntu 22.04.4 LTS

Juju CLI: 3.6.0-ubuntu-amd64

Juju agent: 3.4.4

Charm revision: 281, channel 14/stable

kubectl:
Client Version: v1.31.3
Server Version: v1.26.15

Log output

Juju debug log:

Patroni logs:

Unit 1:

2024-12-02 06:07:30 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:40 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:07:40 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:00 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:07:00 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:10 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:07:10 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:20 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:07:20 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:30 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:06:20 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:30 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:06:30 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:40 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:06:40 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:50 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:06:50 UTC [1481760]: INFO: waiting for leader to bootstrap

Unit 2:

2024-12-02 06:07:43 UTC [1463028]: WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 
2024-12-02 06:07:43 UTC [1463028]: INFO: Error communicating with PostgreSQL. Will try again later 
2024-12-02 06:07:43 UTC [1463028]: INFO: Lock owner: None; I am postgresql-k8s-1 
2024-12-02 06:07:43 UTC [1463028]: INFO: Still starting up as a standby. 
2024-12-02 06:07:43 UTC [1463028]: INFO: establishing a new patroni connection to the postgres cluster 
2024-12-02 06:07:43 UTC [1463028]: INFO: establishing a new patroni connection to the postgres cluster 
2024-12-02 06:07:43 UTC [1463028]: WARNING: Retry got exception: connection problems 
2024-12-02 06:07:43 UTC [1463028]: WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 
2024-12-02 06:07:13 UTC [1463028]: INFO: restarting after failure in progress 
2024-12-02 06:07:23 UTC [1463028]: INFO: Lock owner: None; I am postgresql-k8s-1 
2024-12-02 06:07:23 UTC [1463028]: INFO: not healthy enough for leader race 
2024-12-02 06:07:23 UTC [1463028]: INFO: restarting after failure in progress 
2024-12-02 06:07:33 UTC [1463028]: INFO: Lock owner: None; I am postgresql-k8s-1 
2024-12-02 06:07:33 UTC [1463028]: INFO: not healthy enough for leader race 
2024-12-02 06:07:33 UTC [1463028]: INFO: restarting after failure in progress

Unit 3:

2024-12-02 06:07:30 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:30 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:40 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:40 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:50 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:00 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:00 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:10 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:10 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:20 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:20 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:20 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:06:20 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:30 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:06:30 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:40 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:06:40 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:50 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2024-12-03T17:02:53Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6142.

This message was autogenerated

kelkawi-a added the bug Something isn't working label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster of 3 units has no primary unit #796

Cluster of 3 units has no primary unit #796

kelkawi-a commented Dec 3, 2024 •

edited

Loading

syncronize-issues-to-jira bot commented Dec 3, 2024

Cluster of 3 units has no primary unit #796

Cluster of 3 units has no primary unit #796

Comments

kelkawi-a commented Dec 3, 2024 • edited Loading

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

syncronize-issues-to-jira bot commented Dec 3, 2024

kelkawi-a commented Dec 3, 2024 •

edited

Loading