Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CML provisioning fails #81

Open
nmarian85 opened this issue Dec 16, 2022 · 0 comments
Open

CML provisioning fails #81

nmarian85 opened this issue Dec 16, 2022 · 0 comments
Assignees
Labels
bug Something isn't working question Further information is requested
Milestone

Comments

@nmarian85
Copy link
Contributor

nmarian85 commented Dec 16, 2022

CDP Control Plane Region: EU-1

Configuration:

    ml_worker:
      instance_type: m6a.8xlarge
      instance_count: 1
      min_instances: 1
      max_instances: 4
      root_volume: 512
      instance_tier: ON_DEMAND
    ml_worker_gpu:
      min_instances: 0
      max_instances: 3
      instance_count: 0
      instance_tier: ON_DEMAND
      instance_type: g4dn.2xlarge
      root_volume: 512
    enable_governance: false

Module configs

- name: Create instance groups
  block:
    - name: Set standard non-gpu instance groups
      set_fact:
        instance_groups:
          - name: cpu_settings
            autoscaling:
              maxInstances: "{{ ml_worker['max_instances'] }}"
              minInstances: "{{ ml_worker['min_instances'] }}"
            instanceType: "{{ ml_worker['instance_type'] }}"
            instanceTier: "{{ ml_worker['instance_tier'] }}"
            rootVolume:
              size: "{{ ml_worker['root_volume'] }}"

    - name: Add GPU instance group if defined
      set_fact:
        instance_groups: "{{ instance_groups + gpu_instance_group }}"
      when: "'ml_worker_gpu' in cml_cluster"
      vars:
        ml_worker_gpu: "{{ cml_cluster['ml_worker_gpu'] }}"
        gpu_instance_group:
          - name: gpu_settings
            autoscaling:
              maxInstances: "{{ ml_worker_gpu['max_instances'] }}"
              minInstances: "{{ ml_worker_gpu['min_instances'] }}"
            instanceType: "{{ ml_worker_gpu['instance_type'] }}"
            instanceTier: "{{ ml_worker_gpu['instance_tier'] }}"
            rootVolume:
              size: "{{ ml_worker_gpu['root_volume'] }}"
  vars:
    ml_worker: "{{ cml_cluster['ml_worker'] }}"

- name: "Install ML workspace {{ cml_cluster_name }}"
  cloudera.cloud.ml:
    name: "{{ cml_cluster_name }}"
    env: "{{ env_name }}"
    k8s_request:
      environmentName: "{{ env_name }}"
      instanceGroups: "{{ instance_groups }}"
      tags: "{{ cml_cluster['tags'] }}"
    governance: "{{ cml_cluster['enable_governance'] }}"
    public_loadbalancer: false
    monitoring: true
    ip_addresses: []
    debug: true
    timeout: 7200
    cp_region: "{{ cp_region }}"

Errors

│   Normal   Scheduled    5m4s                   default-scheduler  Successfully assigned mlx/ds-operator-5b64cfc648-x7nxp to ip-10-132-9-62.eu-central-1.compute.internal                                                                  │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "ds-operator-tls" : secret "ds-operator-tls2" not found                                                                             │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "ds-vfs-crt" : secret "ds-vfs-tls2" not found                                                                                       │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "s2i-registry-auth-crt" : secret "s2i-registry-auth-tls2" not found                                                                 │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "tgtgen-tls" : secret "tgtgen-tls2" not found                                                                                       │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "tcp-ingress-controller-crt" : secret "tcp-ingress-controller-tls2" not found                                                       │
│   Warning  FailedMount  5m1s (x3 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "s2i-registry-crt" : secret "s2i-registry-tls2" not found                                                                           │
│   Warning  FailedMount  5m1s (x3 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "host-ssh-keys" : secret "cdsw-host-ssh-keys" not found                                                                             │
│   Warning  FailedMount  5m1s (x3 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "ds-web-crt" : secret "web-tls2" not found                                                                                          │
│   Warning  FailedMount  5m1s (x3 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "ds-cdh-client-crt" : secret "ds-cdh-client-tls2" not found                                                                         │
│   Warning  FailedMount  5m1s                   kubelet            MountVolume.SetUp failed for volume "api-crt" : secret "api-tls2" not found                                                                                             │
│   Warning  FailedMount  4m44s (x2 over 4m49s)  kubelet            (combined from similar events): MountVolume.SetUp failed for volume "api-crt" : secret "api-tls2" not found    





│   Type     Reason     Age                    From               Message                                                                                                                                                                   │
│   ----     ------     ----                   ----               -------                                                                                                                                                                   │
│   Normal   Scheduled  9m6s                   default-scheduler  Successfully assigned mlx/grafana-core-c88b74df5-nfvlp to ip-10-132-9-95.eu-central-1.compute.internal                                                                    │
│   Normal   Pulling    8m55s                  kubelet            Pulling image "container.repository.cloudera.com/cloudera/cdsw/cdsw-ubi-minimal:2.0.34-b116"                                                                              │
│   Normal   Pulled     8m52s                  kubelet            Successfully pulled image "container.repository.cloudera.com/cloudera/cdsw/cdsw-ubi-minimal:2.0.34-b116" in 3.107336927s                                                  │
│   Normal   Created    8m52s                  kubelet            Created container grafana-root-migration                                                                                                                                  │
│   Normal   Started    8m52s                  kubelet            Started container grafana-root-migration                                                                                                                                  │
│   Normal   Pulling    8m51s                  kubelet            Pulling image "container.repository.cloudera.com/cloudera_thirdparty/ubi-grafana:6.7.4-ubi-8.5-239.cldr.1"                                                                │
│   Normal   Pulled     8m41s                  kubelet            Successfully pulled image "container.repository.cloudera.com/cloudera_thirdparty/ubi-grafana:6.7.4-ubi-8.5-239.cldr.1" in 10.000649349s                                   │
│   Normal   Created    8m41s                  kubelet            Created container grafana-core                                                                                                                                            │
│   Normal   Started    8m41s                  kubelet            Started container grafana-core                                                                                                                                            │
│   Warning  Unhealthy  8m28s (x4 over 8m40s)  kubelet            Readiness probe failed: Get "http://100.100.74.70:3000/login": dial tcp 100.100.74.70:3000: connect: connection refused                                                   │
│   Warning  Unhealthy  3m27s (x26 over 7m7s)  kubelet            Readiness probe failed: Get "http://100.100.74.70:3000/login": context deadline exceeded (Client.Timeout exceeded while awaiting headers)    



   Normal   Scheduled    9m43s                   default-scheduler  Successfully assigned mlx/tcp-ingress-controller-56597b95cf-nfpk7 to ip-10-132-9-95.eu-central-1.compute.internal                                                      │
│   Warning  FailedMount  9m24s (x6 over 9m40s)   kubelet            MountVolume.SetUp failed for volume "web-crt" : secret "web-tls2" not found                                                                                            │
│   Warning  FailedMount  9m24s (x6 over 9m40s)   kubelet            MountVolume.SetUp failed for volume "operator-crt" : secret "ds-operator-tls2" not found                                                                               │
│   Warning  FailedMount  9m24s (x6 over 9m40s)   kubelet            MountVolume.SetUp failed for volume "tcp-ingress-controller-tls" : secret "tcp-ingress-controller-tls2" not found                                                      │
│   Normal   Pulling      8m58s                   kubelet            Pulling image "container.repository.cloudera.com/cloudera/cdsw/tcp-ingress-controller:2.0.34-b116"                                                                     │
│   Normal   Pulled       8m54s                   kubelet            Successfully pulled image "container.repository.cloudera.com/cloudera/cdsw/tcp-ingress-controller:2.0.34-b116" in 3.324504111s                                         │
│   Normal   Created      8m54s                   kubelet            Created container tcp-ingress-controller                                                                                                                               │
│   Normal   Started      8m54s                   kubelet            Started container tcp-ingress-controller                                                                                                                               │
│   Warning  Unhealthy    8m18s                   kubelet            Liveness probe failed: dial tcp 100.100.74.82:8000: connect: connection refused                                                                                        │
│   Warning  Unhealthy    4m28s (x31 over 8m38s)  kubelet            Readiness probe failed: dial tcp 100.100.74.82:8000: connect: connection refused  



│   Warning  Unhealthy    9m46s (x30 over 13m)  kubelet            Readiness probe failed: Get "http://100.100.74.75:3000/internal/load-balancer/health-ping": dial tcp 100.100.74.75:3000: connect: connection refused                     │
Normal	EnsuredLoadBalancer	60m	Ensured load balancer
2022-12-16T12:14:16.777Z	Service: MLXControlPlane, Message: &ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{LoadBalancerIngress{IP:,Hostname:ac74be2de1e8c4bc6a9d551978d9ab77-4127b221295ea1bb.elb.eu-central-1.amazonaws.com,Ports:[]PortStatus{},},},},Conditions:[]Condition{},}
2022-12-16T12:14:16.965Z	Service: MLXControlPlane, Message: Pod(s) not ready: [api-67488979d7-8h46b ds-reconciler-6dd6ccf448-5kgq6 grafana-core-c88b74df5-6sz96 runtime-addon-trigger-2.0.34-b116-pzhzh web-65c7f5c99c-skmfd]
2022-12-16T12:17:17.208Z	Service: MLXControlPlane, Message: api-67488979d7-8h46b: Warning	BackOff	62m	Back-off restarting failed container
2022-12-16T12:17:17.229Z	Service: MLXControlPlane, Message: ds-reconciler-6dd6ccf448-5kgq6: Warning	BackOff	60m	Back-off restarting failed container
2022-12-16T12:17:17.252Z	Service: MLXControlPlane, Message: grafana-core-c88b74df5-6sz96: Warning	Unhealthy	60m	Readiness probe failed: Get "http://100.100.184.74:3000/login": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-12-16T12:17:17.269Z	Service: MLXControlPlane, Message: runtime-addon-trigger-2.0.34-b116-pzhzh: Normal	Created	62m	Created container runtime-addon-trigger
Normal	Started	62m	Started container runtime-addon-trigger
Normal	Pulled	61m	Container image "container.repository.cloudera.com/cloudera/cdsw/runtime-addon-loader:2.0.34-b116" already present on machine
Warning	BackOff	60m	Back-off restarting failed container
2022-12-16T12:17:17.291Z	Service: MLXControlPlane, Message: web-65c7f5c99c-skmfd: 
2022-12-16T12:17:17.297Z	Service: MLXControlPlane, Message: Failed to install ML workspace. Reason:client rate limiter Wait returned an error: rate: Wait(n=1) would exceed 
@wmudge wmudge added bug Something isn't working question Further information is requested labels Oct 5, 2023
@wmudge wmudge added this to the Release 2.1.0 milestone Oct 5, 2023
@wmudge wmudge modified the milestones: Release 2.1.0, Release 2.2.0 Nov 3, 2023
@wmudge wmudge modified the milestones: Release 2.2.0, Release 2.3.0 Nov 20, 2023
@wmudge wmudge self-assigned this Dec 20, 2023
@wmudge wmudge modified the milestones: Release 2.3.0, Release 2.4.0 Dec 20, 2023
@wmudge wmudge assigned jimright and unassigned wmudge Dec 20, 2023
@wmudge wmudge modified the milestones: Release 2.4.0, Release 2.5.0 May 21, 2024
@wmudge wmudge modified the milestones: Release 2.5.0, Release 2.6.0 Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Development

No branches or pull requests

3 participants