Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]1.0 postgresql cluster created failed due to Back-off restarting failed container #8512

Closed
tianyue86 opened this issue Nov 25, 2024 · 1 comment
Assignees
Labels
kind/bug Something isn't working severity/major Great chance user will encounter the same problem
Milestone

Comments

@tianyue86
Copy link

Describe the env

Kubernetes: v1.31.1-aliyun.1
KubeBlocks: 1.0.0-beta.5
kbcli: 1.0.0-beta.3

To Reproduce
Steps to reproduce the behavior:

  1. Apply the following yaml to create pg cluster
apiVersion: apps.kubeblocks.io/v1
kind: Cluster
metadata:
  name: postgres-pefwre
  namespace: default
spec:
  terminationPolicy: WipeOut
  clusterDef: postgresql
  topology: replication
  componentSpecs:
    - name: postgresql
      labels:
        apps.kubeblocks.postgres.patroni/scope: postgres-pefwre-postgresql
      replicas: 2
      serviceAccountName:
      disableExporter: true
      resources:
        limits:
          cpu: 100m
          memory: 0.5Gi
        requests:
          cpu: 100m
          memory: 0.5Gi
      volumeClaimTemplates:
        - name: data
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 20Gi
  1. check cluster status : Failed
k get cluster -A
NAMESPACE   NAME              CLUSTER-DEFINITION   TERMINATION-POLICY   STATUS   AGE
default     postgres-pefwre   postgresql           WipeOut              Failed   23m
  1. Check pod status: CrashLoopBackOff
k get pod
NAME                           READY   STATUS                  RESTARTS       AGE
postgres-pefwre-postgresql-0   0/4     Init:CrashLoopBackOff   9 (2m1s ago)   23m
postgres-pefwre-postgresql-1   0/4     Init:CrashLoopBackOff   9 (2m1s ago)   23m
  1. describe pod: Back-off restarting failed container
k describe pod postgres-pefwre-postgresql-0
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               24m                   default-scheduler        Successfully assigned default/postgres-pefwre-postgresql-0 to cn-zhangjiakou.10.0.0.140
  Normal   SuccessfulAttachVolume  24m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "d-8vbc2tijxgn988kelhsf"
  Normal   AllocIPSucceed          24m                   terway-daemon            Alloc IP 10.0.0.190/24 took 43.754728ms
  Normal   Pulling                 24m                   kubelet                  Pulling image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/spilo:16.4.0"
  Normal   Pulled                  24m                   kubelet                  Successfully pulled image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/spilo:16.4.0" in 10.034s (10.034s including waiting). Image size: 240042369 bytes.
  Normal   Created                 24m                   kubelet                  Created container pg-init-container
  Normal   Started                 24m                   kubelet                  Started container pg-init-container
  Normal   Pulling                 24m                   kubelet                  Pulling image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/dbctl:0.1.5"
  Normal   Pulled                  23m                   kubelet                  Successfully pulled image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/dbctl:0.1.5" in 735ms (735ms including waiting). Image size: 22157888 bytes.
  Normal   Created                 23m                   kubelet                  Created container init-dbctl
  Normal   Started                 23m                   kubelet                  Started container init-dbctl
  Normal   Pulled                  23m                   kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/kubeblocks-tools:1.0.0-beta.5" already present on machine
  Normal   Created                 23m                   kubelet                  Created container init-kbagent
  Normal   Started                 23m                   kubelet                  Started container init-kbagent
  Normal   Started                 23m (x3 over 23m)     kubelet                  Started container kbagent-worker
  Normal   Pulled                  23m (x4 over 23m)     kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/spilo:16.4.0" already present on machine
  Normal   Created                 23m (x4 over 23m)     kubelet                  Created container kbagent-worker
  Warning  BackOff                 4m11s (x91 over 23m)  kubelet                  Back-off restarting failed container kbagent-worker in pod postgres-pefwre-postgresql-0_default(cc87737c-74d1-4c30-8914-99cbf58db546)
  1. check cmp: failed
k get cmp
NAME                         DEFINITION                    SERVICE-VERSION   STATUS   AGE
postgres-pefwre-postgresql   postgresql-16-1.0.0-alpha.0   16.4.0            Failed   25m

Status:
  Conditions:
    Last Transition Time:  2024-11-25T02:42:30Z
    Message:               The operator has started the provisioning of Cluster: postgres-pefwre-postgresql
    Observed Generation:   1
    Reason:                PreCheckSucceed
    Status:                True
    Type:                  ProvisioningStarted
    Last Transition Time:  2024-11-25T02:42:31Z
    Message:               the component phase is Failed
    Observed Generation:   1
    Reason:                Unavailable
    Status:                False
    Type:                  Available
  Message:
    InstanceSet/postgres-pefwre-postgresql:  ["postgres-pefwre-postgresql-1"]
  Observed Generation:                       1
  Phase:                                     Failed
Events:
  Type    Reason                    Age   From                  Message
  ----    ------                    ----  ----                  -------
  Normal  Unknown                   26m   component-controller  the component phase is unknown
  Normal  ComponentPhaseTransition  26m   component-controller  component is Creating
  Normal  Unavailable               26m   component-controller  the component phase is Creating
  Normal  ComponentPhaseTransition  25m   component-controller  component is Failed
  Normal  Unavailable               25m   component-controller  the component phase is Failed
  1. check sc
k get sc
NAME                                 PROVISIONER                       RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
alicloud-disk-efficiency (default)   diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   72m
alicloud-disk-essd                   diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   72m
alicloud-disk-ssd                    diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   72m
alicloud-disk-topology-alltype       diskplugin.csi.alibabacloud.com   Delete          WaitForFirstConsumer   true                   72m
kb-default-sc                        diskplugin.csi.alibabacloud.com   Delete          WaitForFirstConsumer   true                   55m
  1. check pvc
k get pvc
NAME                                STATUS   VOLUME                   CAPACITY   ACCESS MODES   STORAGECLASS    VOLUMEATTRIBUTESCLASS   AGE
data-postgres-pefwre-postgresql-0   Bound    d-8vbc2tijxgn988kelhsf   20Gi       RWO            kb-default-sc   <unset>                 31m
data-postgres-pefwre-postgresql-1   Bound    d-8vb88aki0yj1n8yufble   20Gi       RWO            kb-default-sc   <unset>                 31m

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@tianyue86 tianyue86 added kind/bug Something isn't working severity/major Great chance user will encounter the same problem labels Nov 25, 2024
@tianyue86 tianyue86 added this to the Release 1.0.0 milestone Nov 25, 2024
@tianyue86
Copy link
Author

Now it works well in the following environment:
Kubernetes: v1.31.1-aliyun.1
KubeBlocks: 1.0.0-beta.6
kbcli: 1.0.0-beta.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working severity/major Great chance user will encounter the same problem
Projects
None yet
Development

No branches or pull requests

3 participants