Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

psmdb-backup irregularly fails to complete due to various reasons #1637

Open
michaelSchmidMaloon opened this issue Sep 2, 2024 · 7 comments
Labels

Comments

@michaelSchmidMaloon
Copy link

Report

We're running several psmdb clusters in multiple kubernetes clusters. For most of them the psmdb-backup failed at least once during their lifetime. We have three psmdb clusters where the backup irregularly fails around once a week.

We found a multitude of seemingly underlying issues that we can't track down to infrastructure or faulty configuration.

We can't find indicators for when a backup will fail and can't see differences when a backup fails versus it working for multiple days back to back again.

The affected clusters differ is storage size and throughput, from some 10MB to a few GB. But nothing in the size of > 100GB.

We first noticed this problem in October 2023.

More about the problem

pbm-agent hangs

To debug the faulty pods, we open a shell to the pbm-agent container and issue pbm commands. Sometimes the pbm-agent is unresponsive and needs to be killed from inside the container with

kill -2 $(grep -E "^Pid:" $(grep -l pbm-agent /proc/*[0-9]/status) | rev | cut -f1 | rev)

Election winner does nothing

Logs of the backup-agent container state that an election has stated for which agent is responsible for creating a backup. One gets elected and the others respect that, thus not starting a backup. Unfortunately sometimes the elected container also does not start the backup.

Stale lock in database

Mostly in combination with the above issue, sometime backup processes stop during execution and leave a stale lock in the database, preventing subsequent backup jobs from creating new backups.

Starting deadline exceeded

Other times the backup-agent logs Error: starting deadline exceeded also often creating a stale lock in the database.

Steps to reproduce

  1. Create psmdb-cluster (3 nodes for our use-case) with psmdb-operator
  2. Configure daily backups to be written to cloud provider (e.g. AWS S3)
  3. Wait some time
  4. psmdb-backup will fail

Versions

  1. Kubernetes v1.28.11
  2. Operator chart version 1.16.2 / 1.16.3
  3. Database 1.16.3

Anything else?

No response

@michaelSchmidMaloon
Copy link
Author

Tonight's backup failed with

[...]
2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091]
2024/09/02 10:07:35 [#################.......]  <database>.<collection>  136762220/190760640  (71.7%)
2024-09-02T10:07:35.760+0000	Mux close namespace <database>.<collection>
2024-09-02T10:07:35.760+0000	Mux finish
2024-09-02T10:07:35.760+0000	archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe
2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )
2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock

@michaelswierszcz
Copy link

seeing this as well. we had upgraded our clusters and operator from v1.15 -> v1.16 and these intermittent backup failure's started happening daily.

percona-backup-mongodb 2.3.0 -> 2.4.1

@michaelswierszcz
Copy link

reverted back 2.4.1 -> 2.3.0 and all my backups started working again. clusters and operator remain on v1.16

@hors
Copy link
Collaborator

hors commented Sep 11, 2024

Tonight's backup failed with

[...]
2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091]
2024/09/02 10:07:35 [#################.......]  <database>.<collection>  136762220/190760640  (71.7%)
2024-09-02T10:07:35.760+0000	Mux close namespace <database>.<collection>
2024-09-02T10:07:35.760+0000	Mux finish
2024-09-02T10:07:35.760+0000	archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe
2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )
2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock

It is difficult to understand root of the issue. Maybe it some storage connection problem. I can suggest to increase numMaxRetries

@asemarian
Copy link

Same problem. Though in my case they're failing for the exact same reason: starting deadline exceeded. The last ~7 backups have all failed in the same way. I'm using v1.16.1 of the operator and database server running on Kubernetes v1.30.5.

@michaelswierszcz
Copy link

quick update - i've switched back to v2.6.0 and added cpu and memory to my node, haven't had a failure since

@hors
Copy link
Collaborator

hors commented Nov 15, 2024

@michaelswierszcz @asemarian PSMDBO 1.18.0 with PBM 2.7.0 was released yesterday. PBM 2.6.0/2.7.0 has a lot of bug fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants