-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stolon does not recover after temporary full disk on master keeper #912
Comments
This happened again today, but this time when the disk was not full on the machine that stopped serving Logs:
The issue started |
I have the same issue. Slaves can't sync with basebackup.
|
In my situation master fixed using restart sentinels and remove slave keepers. For followers, remove keeper, restart sentinels and pg_resetwal fixed problem. |
What happened
Today we had a production outage of a 11 minutes, suspectedly because:
pg_basebackup
requests from follower nodes.The downtime lasted only 11 minutes because I manually restarted the master
stolon-keeper
that stopped servingpg_basebackup
requests.I observed the following in our 3-node Stolon cluster composed of nodes
node-4
,node-5
,node-6
:node-5
temporarily ran out of disk space while being stolon master.node-5
master, by runningpg_basebackup
.node-5
.pg_basebackup
, and restarting them did not help.pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup
responses after the full-disk event, even though disk space was freed up.Write...()
functions:What you expected to happen
Stolon should serve
pg_basebackup
responses when the disk is no longer full.How to reproduce it (as minimally and precisely as possible)
I have not yet tried to reproduce it exactly, as I am not sure how to trigger the other 2 nodes trying to do
pg_basebackup
at the correct time.Logs
Follower logs
Follower logs hanging in
pg_basebackup: initiating base backup, waiting for checkpoint to complete
:Click to expand logs
Master logs
Logs of the master while it was not serving the other nodes'
pg_basebackup
requests:Click to expand logs
Logs of the master keeper during the entirety of the event, from the initial
No space left on device
to recovery by my manual restart of the keeper:Click to expand logs
Environment
master
commit 4bb4107The text was updated successfully, but these errors were encountered: