Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster of 3 units has no primary unit #796

Open
kelkawi-a opened this issue Dec 3, 2024 · 1 comment
Open

Cluster of 3 units has no primary unit #796

kelkawi-a opened this issue Dec 3, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@kelkawi-a
Copy link

kelkawi-a commented Dec 3, 2024

Steps to reproduce

  1. Deploy 3 units of postgresql-k8s charm, channel 14/stable revision 281

Expected behavior

The units remain in an active state

Actual behavior

After running fine for a while (i.e. all three units were active and functional", two of the three units became stuck in a waiting/maintenance state with the following status:

postgresql-k8s/0             maintenance      idle         reinitialising replica
postgresql-k8s/1             active                 idle         Primary
postgresql-k8s/2*            maintenance     idle          reinitialising replica

There were no reported outages to the cluster or node restarts. Upon further debugging, it seems that the patroni K8s service disappeared for unknown reason. I do not have the logs for it, but during the debugging process, the cluster itself could not identify a primary, with all three units identified as replicas.

It is worth noting that as part of the recovery process, we tried to re-initialize each of the units, but could not due to the following:

Cluster has no leader, can not reinitialize

We also tried doing a failover with the following output:

root@postgresql-k8s-1:/# curl -s -k https://<unit_ip>:8008/failover -X POST -d '{"candidate":"postgresql-k8s-1"}'

failover is not possible: no good candidates have been found

Versions

Operating system: Ubuntu 22.04.4 LTS

Juju CLI: 3.6.0-ubuntu-amd64

Juju agent: 3.4.4

Charm revision: 281, channel 14/stable

kubectl:
Client Version: v1.31.3
Server Version: v1.26.15

Log output

Juju debug log:

Patroni logs:

Unit 1:

2024-12-02 06:07:30 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:40 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:07:40 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:00 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:07:00 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:10 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:07:10 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:20 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:07:20 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:30 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:06:20 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:30 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:06:30 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:40 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:06:40 UTC [1481760]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:50 UTC [1481760]: INFO: Lock owner: None; I am postgresql-k8s-0 
2024-12-02 06:06:50 UTC [1481760]: INFO: waiting for leader to bootstrap 

Unit 2:

2024-12-02 06:07:43 UTC [1463028]: WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 
2024-12-02 06:07:43 UTC [1463028]: INFO: Error communicating with PostgreSQL. Will try again later 
2024-12-02 06:07:43 UTC [1463028]: INFO: Lock owner: None; I am postgresql-k8s-1 
2024-12-02 06:07:43 UTC [1463028]: INFO: Still starting up as a standby. 
2024-12-02 06:07:43 UTC [1463028]: INFO: establishing a new patroni connection to the postgres cluster 
2024-12-02 06:07:43 UTC [1463028]: INFO: establishing a new patroni connection to the postgres cluster 
2024-12-02 06:07:43 UTC [1463028]: WARNING: Retry got exception: connection problems 
2024-12-02 06:07:43 UTC [1463028]: WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 
2024-12-02 06:07:13 UTC [1463028]: INFO: restarting after failure in progress 
2024-12-02 06:07:23 UTC [1463028]: INFO: Lock owner: None; I am postgresql-k8s-1 
2024-12-02 06:07:23 UTC [1463028]: INFO: not healthy enough for leader race 
2024-12-02 06:07:23 UTC [1463028]: INFO: restarting after failure in progress 
2024-12-02 06:07:33 UTC [1463028]: INFO: Lock owner: None; I am postgresql-k8s-1 
2024-12-02 06:07:33 UTC [1463028]: INFO: not healthy enough for leader race 
2024-12-02 06:07:33 UTC [1463028]: INFO: restarting after failure in progress 

Unit 3:

2024-12-02 06:07:30 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:30 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:40 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:40 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:50 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:00 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:00 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:10 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:10 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:07:20 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:07:20 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:20 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:06:20 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:30 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:06:30 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:40 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
2024-12-02 06:06:40 UTC [534630]: INFO: waiting for leader to bootstrap 
2024-12-02 06:06:50 UTC [534630]: INFO: Lock owner: None; I am postgresql-k8s-2 
@kelkawi-a kelkawi-a added the bug Something isn't working label Dec 3, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6142.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant