Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task and web replicas are scaled to 0 by the operator #1960

Open
3 tasks done
jdratlif opened this issue Sep 18, 2024 · 3 comments
Open
3 tasks done

task and web replicas are scaled to 0 by the operator #1960

jdratlif opened this issue Sep 18, 2024 · 3 comments

Comments

@jdratlif
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

After installing a new awx with awx-operator, it scales the web and task deployments down to 0 and awx is completely stopped. It never scales the deployments back up.

AWX Operator version

2.19.1

AWX version

1.27.12

Kubernetes platform

kubernetes

Kubernetes/Platform version

k3s

Modifications

no

Steps to reproduce

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.19.1
  - awx.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.19.1

# Specify a custom namespace in which to install AWX
namespace: awx-test
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: le-staging
spec:
  acme:
    privateKeySecretRef:
      name: le-staging
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    solvers:
      - http01:
          ingress:
            ingressClassName: traefik
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: jdr1
spec:
  hostname: jdr1.k8s.test.example.com
  ingress_type: ingress
  ingress_annotations: |
    cert-manager.io/issuer: le-staging
    traefik.ingress.kubernetes.io/router.middlewares: default-bastion-office-vpn@kubernetescrd
  ingress_tls_secret: awx-tls-le-staging
  service_type: ClusterIP
  postgres_data_volume_init: true

Expected results

I expected awx to be running.

Actual results

It starts up, then gets stopped, and doesn't restart without manual intervention.

Additional information

If I use awx-operator 2.18, I don't have this problem. It seems like the problem happened something in 2.19.0 or 2.19.1 release.

Operator Logs

 TASK [Apply deployment resources] ******************************** 
fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined\n\nThe error appears to be in '/opt/ansible/roles/installer/tasks/resources_configuration.yml': line 248, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Apply deployment resources\n  ^ here\n"}

I saw this referenced in #1907, but I'm not upgrading from 2.18, and re-applying the CRDs didn't fix things for me.

@jdratlif
Copy link
Author

It's not clear to me how the awx CRD spec values get translated into ansible vars, but 8ead140 this commit added the web_manage_replicas and task_manage_replicas saying the default is true, but there were no new defaults added to defaults/main.yml to configure them. But web_replicas and task_replicas are set to empty strings there. Do we not need web_manage_replicas and task_manage_replicas set to true in the defaults there as well?

@jdratlif
Copy link
Author

Okay, I think I know what is happening.

Another person is using the awx operator in our cluster. I didn't think this would matter because we're using different namespaces. But the CRDs are not namespaced, so the CRDs are being overwritten at "random" times, and then I lose the values from the newer CRD definitions.

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.19.1
  # - awx.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.19.1

# Specify a custom namespace in which to install AWX
namespace: sea

Running kustomize on this does not fix the CRDs. But doing kubectl apply --server-side --force-conflicts -k "github.com/ansible/awx-operator/config/crd?ref=2.19.1" does, at least until whatever helm job installs the older awx-operator on the other namespace kicks in. Downgrades work, upgrades don't? Or maybe it's kustomize vs helm. I'm not sure. I do know that the the CRDs are being overwritten, because after I delete my namespace and try to start over, I can check for postgres_data_volume_init in the CRD and it will be a field, but if I keep checking, it will disappear and postgres_data_path will be there.

@YaronL16
Copy link
Contributor

If installing 2 operators of different versions on the same clusters, use the newer CRDs otherwise your new instance will not work. Can't guarantee the older one will work with new CRDs, but it is more likely to, that's the price that comes with deploying 2 different versions on the same cluster.

Get the other person to stop installing their old CRDs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants