If you need help, first search if there is already an issue filed
or please log into jira and create a new issue in the OADP
project.
- OADP Cheat Sheet
- OADP FAQ
- OADP Official Troubleshooting Documentation
- OADP must-gather
- Debugging Failed Backups
- Debugging Failed Restores
- Debugging OpenShift Virtualization backup/restore
- Deleting Backups
- Debugging Data Mover (OADP 1.2 or below)
- OpenShift ROSA STS and OADP installation
- Common Issues and Misconfigurations
- Credentials Not Properly Formatted
- Errors in the Velero Pod
- Errors in Backup Logs
- Backup/Restore is Stuck In Progress
- Restic - NFS Volumes and rootSquash
- Issue with Backup/Restore of DeploymentConfig using Restic
- New Restic Backup Partially Failing After Clearing Bucket
- Restic Restore Partially Failing on OCP 4.14 Due to Changed PSA Policy
-
OpenShift commands
- Check for validation errors in the backup by running the following command,
oc describe backup <backupName> -n openshift-adp
- Check the Velero logs
oc logs -f deploy/velero -n openshift-adp
- If Data Mover (OADP 1.2 or below) is enabled, check the volume-snapshot-logs
oc logs -f deployment.apps/volume-snapshot-mover -n openshift-adp
-
Velero commands
- Alias the velero command:
alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'
- Get the backup details:
velero backup describe <backupName> --details
- Get the backup logs:
velero backup logs <backupName>
-
Restic backup debug
- Please refer to the restic troubleshooting tips page
-
Volume Snapshots debug
- This guide has not yet been published
-
CSI Snapshots debug
- This guide has not yet been published
This section includes how to debug a failed restore. For more specific issues related to restic/CSI/Volume snapshots check out the following section.
-
OpenShift commands
- Check for validation errors in the backup by running the following command,
oc describe restore <restoreName> -n openshift-adp
- Check the Velero logs
oc logs -f deployment.apps/velero -n openshift-adp
If Data Mover (OADP 1.2 or below) is enabled, check the volume-snapshot-logs
oc logs -f deployment.apps/volume-snapshot-mover -n openshift-adp
-
Velero commands
- Alias the velero command:
alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'
- Get the restore details:
velero restore describe <restoreName> --details
- Get the backup logs:
velero backup logs <restoreName>
A common misunderstanding is that oc delete backup $backup_name
command will delete the backup and artifacts from object storage. This is not the case. The backup object will be deleted temporarily but will be recreated via Velero's sync controller.
To delete an OADP backup, the related objects and off cluster artifacts.
-
OpenShift commands
Create a DeleteBackupRequest object:
apiVersion: velero.io/v1 kind: DeleteBackupRequest metadata: name: deletebackuprequest namespace: openshift-adp spec: backupName: <backupName>
-
Velero commands:
velero backup delete --help
Delete the backup:
velero backup delete <backupName>
The related artifacts will be deleted at different times depending on the backup method:
- Restic: Artifacts are deleted in the next full maintentance cycle after the backup is deleted.
- CSI: Artifacts are deleted immediately when the backup is deleted.
- Kopia: Artifacts are deleted after two full maintentance cycles after the backup is deleted.
Note: In OADP 1.3.x and OADP 1.4.x the full maintenance cycle executes once every 24 hours.
For more information on Restic and Kopia please refer to the following pages:
Once a backup and related artifacts are deleted and no active backups related to the related namespace it is recommended to delete the backupRepository object.
oc get backuprepositories.velero.io -n openshift-adp
oc delete backuprepository <backupRepositoryName> -n openshift-adp
-
Credentials: An example of correct AWS credentials:
[<INSERT_PROFILE_NAME>] aws_access_key_id=<INSERT_VALUE> aws_secret_access_key=<INSERT_VALUE>
Note: Do not use quotes while putting values in place of INSERT_VALUE Placeholders
-
Error:
Backup storage contains invalid top-level directories: [someDirName]
Problem: your object storage root/prefix directory contains directories not from velero's approved list
Solutions:
- Define prefix directory inside a storage bucket where backups are to be uploaded instead of object storage root. In your Velero CR set a prefix for velero to use in:
oc explain backupstoragelocations.velero.io.spec.objectStorage
Example:
objectStorage: bucket: your-bucket-name prefix: <DirName>
- Delete the offending directories from your object storage location.
-
Error:
error getting volume info: rpc error: code = Unknown desc = InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist.\n\tstatus code: 400
Problem AWS PV and Volume snaphot location are in different regions.
Solution Edit Velero
volume_snapshot_location
to the region specified in PV, and change region in VolumeSnapshotLocation resource to the region mentioned in the PV, and then create a new backup.
-
If a backup or restore is stuck as "In Progress," then it is likely that the backup or restore was interrupted. If this is the case, it cannot resume.
-
For further details on your backup, run the command:
velero backup describe <backup-name> --details # --details is optional
- For more details on your restore, run:
velero restore describe <backup-name> --details # --details is optional
- You can delete the backup with the command:
velero backup delete <backupName>
- You can delete the restore with the command:
velero delete restore <restoreName>
- If using NFS volumes while
rootSquash
is enabled, Restic will be mapped tonfsnobody
and not have the proper permissions to perform a backup/restore.
- Use supplemental groups, and apply this same supplemental group to the Restic daemonset.
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: dpa-sample
spec:
configuration:
velero:
defaultPlugins:
- openshift
nodeAgent:
enable: true
uploaderType: restic
supplementalGroups:
- 1234
-
(OADP 1.3+) Error:
DeploymentConfigs restore with spec.Replicas==0 or DC pods fail to restart if they crash if using DC with volumes or restore hooks
Solution:
Solution is the same as in the (OADP 1.1+), except it applies to the use case if you are restoring DeploymentConfigs and have either volumes or post-restore hooks regardless of the backup method.
-
(OADP 1.1+) Error:
DeploymentConfigs restore with spec.Replicas==0 or DC pods fail to restart if they crash if using Restic/Kopia restores or restore hooks
Solution:
This is expected behavior on restore if you are restoring DeploymentConfigs and are either using Restic or Kopia for volume restore or you have post-restore hooks. The pod and DC plugins make these modifications to ensure that Restic or Kopia and hooks work properly, and dc-post-restore.sh should have been run immediately after a successful restore. Usage for this script is
dc-post-restore.sh <restore-name>
-
(OADP 1.0.z) Error:
Using Restic as backup method causes PartiallyFailed/Failed errors in the Restore or post-restore hooks fail to execute
Solution:
The changes in the backup/restore process for mitigating this error would be a two step restore process where, in the first step we would perform a restore excluding the replicationcontroller and deploymentconfig resources, and the second step would involve a restore including these resources. The backup and restore commands are given below for more clarity. (The examples given below are a use case for backup/restore of a target namespace, for other cases a similar strategy can be followed).
Please note that this is a temporary fix for this issue and there are ongoing discussions to solve it.
Step 1: Initiate the backup as any normal backup for restic.
velero create backup <backup-name> -n openshift-adp --include-namespaces=<TARGET_NAMESPACE>
Step 2: Initiate a restore excluding the replicationcontroller and deploymentconfig resources.
velero restore create --from-backup=<BACKUP_NAME> -n openshift-adp --include-namespaces <TARGET_NAMESPACE> --exclude-resources replicationcontroller,deploymentconfig,templateinstances.template.openshift.io --restore-volumes=true
Step 3: Initiate a restore including the replicationcontroller and deploymentconfig resources.
velero restore create --from-backup=<BACKUP_NAME> -n openshift-adp --include-namespaces <TARGET_NAMESPACE> --include-resources replicationcontroller,deploymentconfig,templateinstances.template.openshift.io --restore-volumes=true
After creating a backup for a stateful app using Restic on a given namespace, clearing the bucket, and then creating a new backup again using Restic, the backup will partially fail.
- Velero pod logs after attempting to backup the Mssql app using the steps defined above:
level=error msg="Error checking repository for stale locks" controller=restic-repo error="error running command=restic unlock --repo=s3:s3-us-east-1.amazonaws.com/<bucketname>/<prefix>/restic/mssql-persistent --password-file=/tmp/credentials/openshift-adp/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?\ns3:s3-us-east-1.amazonaws.com/<bucketname>/<prefix>/restic/mssql-persistent\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:293" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:144" name=mssql-persistent-velero-sample-1-ckcj4 namespace=openshift-adp
- Running the command:
velero backup describe <backup-name> --details
results in:
Restic Backups:
Failed:
mssql-persistent/mssql-deployment-1-l7482: mssql-vol
This is a known Velero issue which appears to be in the process of deciding expected behavior.
Issue: OCP 4.14 enforces a Pod Security Admission (PSA) policy that can hinder the readiness of pods during a Restic restore process. If a Security Context Constraints (SCC) resource is not found during the creation of a pod, and the PSA policy on the pod is not aligned with the required standards, pod admission is denied. This issue arises due to the resource restore order of Velero.
- Sample error:
\"level=error\" in line#2273: time=\"2023-06-12T06:50:04Z\" level=error msg=\"error restoring mysql-869f9f44f6-tp5lv: pods \\\"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity \\\"restricted:v1.24\\\": privileged (container \\\"mysql\\\" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (containers \\\"restic-wait\\\", \\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\"RuntimeDefault\\\" or \\\"Localhost\\\")\" logSource=\"/remote-source/velero/app/pkg/restore/restore.go:1388\" restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n velero container contains \"level=error\" in line#2447: time=\"2023-06-12T06:50:05Z\" level=error msg=\"Namespace todolist-mariadb, resource restore error: error restoring pods/todolist-mariadb/mysql-869f9f44f6-tp5lv: pods \\\"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity \\\"restricted:v1.24\\\": privileged (container \\\"mysql\\\" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (containers \\\"restic-wait\\\", \\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\"RuntimeDefault\\\" or \\\"Localhost\\\")\" logSource=\"/remote-source/velero/app/pkg/controller/restore_controller.go:510\" restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n]",
Solution: Restore SCC before Pod. Set restore priority field on velero server to restore SCC before Pod.
- Example DPA setting:
$ oc get dpa -o yaml
configuration:
restic:
enable: true
velero:
args:
restore-resource-priorities: 'securitycontextconstraints,customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io'
defaultPlugins:
- gcp
- openshift
Please note that this is a temporary fix for this issue, and ongoing discussions are in progress to address it. Also, note that if have an existing restore resource priority list, make sure you combine that existing list with the complete list presented in the example above.
- This error can occur regardless of the SCC if the application is not aligned with the security standards. Please ensure that the security standards for the application pods are aligned, as provided in the link below, to prevent deployment warnings.
https://access.redhat.com/solutions/7002730