-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
psmdb-backup irregularly fails to complete due to various reasons #1637
Comments
Tonight's backup failed with
|
seeing this as well. we had upgraded our clusters and operator from v1.15 -> v1.16 and these intermittent backup failure's started happening daily. percona-backup-mongodb 2.3.0 -> 2.4.1 |
reverted back 2.4.1 -> 2.3.0 and all my backups started working again. clusters and operator remain on v1.16 |
It is difficult to understand root of the issue. Maybe it some storage connection problem. I can suggest to increase numMaxRetries |
Same problem. Though in my case they're failing for the exact same reason: |
quick update - i've switched back to v2.6.0 and added cpu and memory to my node, haven't had a failure since |
@michaelswierszcz @asemarian PSMDBO 1.18.0 with PBM 2.7.0 was released yesterday. PBM 2.6.0/2.7.0 has a lot of bug fixes. |
Report
We're running several psmdb clusters in multiple kubernetes clusters. For most of them the psmdb-backup failed at least once during their lifetime. We have three psmdb clusters where the backup irregularly fails around once a week.
We found a multitude of seemingly underlying issues that we can't track down to infrastructure or faulty configuration.
We can't find indicators for when a backup will fail and can't see differences when a backup fails versus it working for multiple days back to back again.
The affected clusters differ is storage size and throughput, from some 10MB to a few GB. But nothing in the size of > 100GB.
We first noticed this problem in October 2023.
More about the problem
pbm-agent hangs
To debug the faulty pods, we open a shell to the pbm-agent container and issue
pbm
commands. Sometimes the pbm-agent is unresponsive and needs to be killed from inside the container withElection winner does nothing
Logs of the backup-agent container state that an election has stated for which agent is responsible for creating a backup. One gets elected and the others respect that, thus not starting a backup. Unfortunately sometimes the elected container also does not start the backup.
Stale lock in database
Mostly in combination with the above issue, sometime backup processes stop during execution and leave a stale lock in the database, preventing subsequent backup jobs from creating new backups.
Starting deadline exceeded
Other times the backup-agent logs
Error: starting deadline exceeded
also often creating a stale lock in the database.Steps to reproduce
Versions
Anything else?
No response
The text was updated successfully, but these errors were encountered: