Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the example doesn't actually work. #4

Open
spacekitteh opened this issue Aug 4, 2023 · 5 comments
Open

Running the example doesn't actually work. #4

spacekitteh opened this issue Aug 4, 2023 · 5 comments

Comments

@spacekitteh
Copy link
Contributor

None of the ceph-mds daemons manage to successfully start, and so no ceph filesystems can be mounted.

@astro
Copy link
Owner

astro commented Aug 4, 2023

What's their error msg?

@spacekitteh
Copy link
Contributor Author

spacekitteh commented Aug 4, 2023

Ok, it seems that there were two issues for ceph-mds:

  1. The /etc/ceph/ceph.mon.keyring and /etc/ceph/ceph.client.admin.keyring needed to have proper permissions set (0444 worked, but it could probably be done better by setting the appropriate users on the services instead)
  2. kmod needed to be in the path for the services which mount the cephfs filesystems.

However, now that the ceph cluster is up and running, there's still an issue: Nomad isn't working.

The journal is just this, over and over and over:

ug 04 23:53:12 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:12 example1 nomad[2905]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:12 example1 nomad[2905]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:12 example1 nomad[2905]: ==> Starting Nomad agent...
Aug 04 23:53:12 example1 skyflake-install-cache-gc-start[2911]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 23ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:16 example1 nomad[2905]: ==> Error starting agent: server config setup failed: Failed to resolve Serf advertise address ":4648": lookup <nil>: no such host
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:16.185Z [ERROR] agent: error starting agent: error="server config setup failed: Failed to resolve Serf advertise address \":4648\": lookup <nil>: no such host"
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Failed with result 'exit-code'.
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Consumed 25ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 62.
Aug 04 23:53:18 example1 systemd[1]: Stopped Nomad.
Aug 04 23:53:18 example1 systemd[1]: nomad.service: Consumed 25ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: Started Nomad.
Aug 04 23:53:18 example1 systemd[1]: Stopped skyflake-install-cache-gc.service.
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 23ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:18 example1 nomad[2920]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:18 example1 nomad[2920]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:18 example1 nomad[2920]: ==> Starting Nomad agent...
Aug 04 23:53:18 example1 skyflake-install-cache-gc-start[2927]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 22ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:21 example1 nomad[2920]: ==> Error starting agent: server config setup failed: Failed to resolve Serf advertise address ":4648": lookup <nil>: no such host
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.468Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.470Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.470Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:21.935Z [ERROR] agent: error starting agent: error="server config setup failed: Failed to resolve Serf advertise address \":4648\": lookup <nil>: no such host"
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Failed with result 'exit-code'.
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Consumed 29ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 63.
Aug 04 23:53:24 example1 systemd[1]: Stopped Nomad.
Aug 04 23:53:24 example1 systemd[1]: nomad.service: Consumed 29ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: Started Nomad.
Aug 04 23:53:24 example1 systemd[1]: Stopped skyflake-install-cache-gc.service.
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 22ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:24 example1 nomad[2935]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:24 example1 nomad[2935]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:24 example1 nomad[2935]: ==> Starting Nomad agent...
Aug 04 23:53:24 example1 skyflake-install-cache-gc-start[2941]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 17ms CPU time, received 40B IP traffic, sent 60B IP traffic.

I suspect it may be due to having virbr0 set up by libvirtd or podman or something? I'm still trying to figure it out.

@astro
Copy link
Owner

astro commented Aug 6, 2023

Could you please share your skyflake.nomad config?

@spacekitteh
Copy link
Contributor Author

It's just the one in example-server.nix:

    nomad = {
      servers = [ "example1" "example2" "example3" ];
      client.meta = {
        example-deployment = "yes";
      };
    };

@astro
Copy link
Owner

astro commented Sep 6, 2023

The example vms use the fec0::/64 range which is dropped by NixOS 23.05's neat nixos-fw-rpfilter feature.

Quick and dirty workaround: ip6tables -t mangle -F PREROUTING on the host

Also, /dev/vdc has become /dev/vdb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants