Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft cluster doesn't switch #164

Open
AlexFavre opened this issue Aug 24, 2022 · 5 comments
Open

raft cluster doesn't switch #164

AlexFavre opened this issue Aug 24, 2022 · 5 comments

Comments

@AlexFavre
Copy link

Hi,
Describe the bug
Raft leader doesn't switch also the Postgres leader switch in RO mode
We have four node for raft (two dedicated and two with postgres)

To Reproduce
have a raft folder without space

Expected behavior
The folder of our raft leader has been full (no space left) also the raft cluster have wait the previous leader and Patroni cluster fall in error (leader in RO mode and Replica can't start).

To fix this issue we have increase space of raft folder but we think it's not the best way.
How we can purge raft file to save space ?
Why raft cluster doesn't switch?

Environment

Patroni version:2.1.4
PostgreSQL version:13.7
DCS (and its version): pure Raft version 0.3.8

extract from Patroni configuration file

cat /etc/patroni/patroni.yml
scope: PGSQLGrafanaProd
namespace: /PGSQLGrafana/
name: PGSQLGrafanaProd1

raft:
data_dir: /apps/pgsql/raft
self_addr: 1.1.1.1:5010
partner_addrs: ['1.1.1.2:5010','1.1.1.3:5010','1.1.1.4:5010']

Have you checked Patroni logs?
Before crash:
Aug 22 20:39:51 s01vl9950891 systemd[1]: patroni.service: Failed with result 'signal'.
Aug 22 20:39:51 s01vl9950891 systemd[1]: patroni.service: Main process exited, code=killed, status=11/SEGV
Aug 22 20:39:46 s01vl9950891 patroni[1906]: 2022-08-22 20:39:46,414 INFO: no action. I am (PGSQLGrafanaProd1), the leader with the lock
Aug 22 20:39:45 s01vl9950891 patroni[1906]: 2022-08-22 20:39:45,672 WARNING: Failed to store full dump
Aug 22 20:39:40 s01vl9950891 patroni[1906]: 2022-08-22 20:39:40,751 WARNING: Failed to store full dump
Aug 22 20:39:36 s01vl9950891 patroni[1906]: 2022-08-22 20:39:36,413 INFO: no action. I am (PGSQLGrafanaProd1), the leader with the lock
Aug 22 20:39:35 s01vl9950891 patroni[1906]: 2022-08-22 20:39:35,927 WARNING: Failed to store full dump
Aug 22 20:39:30 s01vl9950891 patroni[1906]: 2022-08-22 20:39:30,973 WARNING: Failed to store full dump

When we have try to run daemon manually:
Aug 23 09:47:46 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:46,564 WARNING: Loop time exceeded, rescheduling immediately.
Aug 23 09:47:46 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:46,562 INFO: starting after demotion in progress
Aug 23 09:47:46 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:34,561 INFO: Lock owner: PGSQLGrafanaProd1; I am PGSQLGrafanaProd1
Aug 23 09:47:44 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - accepting connections
Aug 23 09:47:42 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections
Aug 23 09:47:41 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections
Aug 23 09:47:40 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections
Aug 23 09:47:39 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections
Aug 23 09:47:38 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections
Aug 23 09:47:37 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections
Aug 23 09:47:36 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections
Aug 23 09:47:36 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections
Aug 23 09:47:35 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:35.222 CEST [1212938] HINT: Future log output will appear in directory "/apps/pgsql/log".
Aug 23 09:47:35 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:35.222 CEST [1212938] LOG: redirecting log output to logging collector process
Aug 23 09:47:35 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - no response
Aug 23 09:47:35 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:35,064 INFO: postmaster pid=1212938
Aug 23 09:47:34 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:34,561 WARNING: Loop time exceeded, rescheduling immediately.
Aug 23 09:47:34 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:34,560 INFO: closed patroni connection to the postgresql cluster
Aug 23 09:47:34 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:34,559 INFO: demoted self because failed to update leader lock in DCS
Aug 23 09:47:32 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:32,324 INFO: Demoting self (immediate-nolock)
Aug 23 09:47:32 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:32,324 ERROR: failed to update leader lock
Aug 23 09:47:32 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:22,322 INFO: Lock owner: PGSQLGrafanaProd1; I am PGSQLGrafanaProd1
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:22,205 INFO: establishing a new patroni connection to the postgres cluster
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:22,100 INFO: No PostgreSQL configuration items changed, nothing to reload.
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: OSError: [Errno 28] No space left on device
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.__incomingTransmissionFile.close()
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/serializer.py", line 193, in setTransmissionData
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: if self.__serializer.setTransmissionData(serialized):
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/syncobj.py", line 896, in __onMessageReceived
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self._onMessageReceivedCallback(node, message)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/transport.py", line 107, in _onMessageReceived
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.__onMessageReceived(message)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/tcp_connection.py", line 173, in __processConnection
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.__descrToCallbacks[descr](descr, eventMask)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/poller.py", line 97, in poll
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self._poller.poll(timeToWait)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/syncobj.py", line 616, in _onTick
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: super(DynMemberSyncObj, self)._onTick(timeToWait)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib/python3.6/site-packages/patroni/dcs/raft.py", line 84, in _onTick
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: super(KVStoreTTL, self)._onTick(timeToWait)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib/python3.6/site-packages/patroni/dcs/raft.py", line 238, in _onTick
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self._onTick(timeToWait)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/syncobj.py", line 532, in doTick
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.doTick(self.conf.autoTickPeriod)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib/python3.6/site-packages/patroni/dcs/raft.py", line 248, in _autoTickThread
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self._target(*self._args, **self._kwargs)
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib64/python3.6/threading.py", line 867, in run
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.run()
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: Traceback (most recent call last):
Aug 23 09:47:22 s01vl9950891 patroni[1212815]: Exception in thread Thread-2:
Aug 23 09:47:17 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:17,084 INFO: waiting on raft
Aug 23 09:47:12 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:12,071 INFO: waiting on raft
Aug 23 09:47:07 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:07,060 INFO: waiting on raft
Aug 23 09:47:02 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:02,049 INFO: waiting on raft

@bakwc
Copy link
Owner

bakwc commented Aug 24, 2022

Could you please provide more details.

  1. What size was the raft file? Does it growing infinitely and you run out of space?
  2. You have 4 nodes in raft cluster. 1 node was full space, 3 nodes were good (no space problems) but raft did not switched the leader to another node, do I understand correctly?

@AlexFavre
Copy link
Author

AlexFavre commented Aug 24, 2022

What size was the raft file? Does it growing infinitely and you run out of space? it's seems like that's
You have 4 nodes in raft cluster. 1 node was full space, 3 nodes were good (no space problems) but raft did not switched the leader to another node, do I understand correctly? Yes 

Leader have this:
ls -allh /apps/pgsql/raft/
total 907M
drwxr-xr-x 2 postgres postgres 4.0K Aug 24 18:32 .
drwxr-x--- 5 postgres postgres 41 Sep 20 2021 ..
-rw-r--r-- 1 postgres postgres 2.3K Aug 24 18:30 1.1.1.1:5010.dump
-rw-r--r-- 1 postgres postgres 1.0G Aug 24 18:32 1.1.1.1:5010.journal
-rw-r--r-- 1 postgres postgres 34 Aug 24 18:32 1.1.1.1:5010.journal.meta

second server have this:
ls -allh /apps/pgsql/raft/
total 76K
drwxr-xr-x 2 postgres postgres 4.0K Aug 24 18:34 .
drwxr-x--- 6 postgres postgres 55 Feb 11 2022 ..
-rw-r--r-- 1 postgres postgres 2.3K Aug 24 18:32 1.1.1.2:5010.dump
-rw-r--r-- 1 postgres postgres 64K Aug 24 18:33 1.1.1.2:5010.journal
-rw-r--r-- 1 postgres postgres 34 Aug 24 18:34 1.1.1.2:5010.journal.meta

@bakwc
Copy link
Owner

bakwc commented Aug 24, 2022

Thanks for report, this is not expected behavior, we have a log compaction, probably it's disabled in patroni by default, I will check.

@AlexFavre
Copy link
Author

Thanks for your answer

@bakwc
Copy link
Owner

bakwc commented Aug 31, 2022

For now I haven't found the reason why file is growing, but I will continue investigating and fix this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants