Debugging munge errors? #1621

yaroslavvb · 2023-07-26T19:51:08Z

yaroslavvb
Jul 26, 2023

Can anyone suggest how I would go about debugging munged errors?

I've created a new cluster (yaml) and I see controller and login nodes, but the compute nodes did not get launched.

slurmd is not running and I see the following in the logs of controller node

÷23-07-26 19:38:33|1|alpha7-controller|~\÷ sudo cat /var/log/slurm/slurmctld.log 
[2023-07-26T19:20:08.877] slurmctld version 22.05.9 started on cluster alpha7
[2023-07-26T19:20:10.880] error: If munged is up, restart with --num-threads=10
[2023-07-26T19:20:10.880] error: Munge encode failed: Failed to access "/run/munge/munge.socket.2": No such file or directory
[2023-07-26T19:20:10.903] fatal: auth/jwt: Could not load key file (/var/spool/slurm/jwt_hs256.key)

On syslog I see this

Jul 26 19:28:23 debian gce_workload_cert_refresh[1706]: 2023/07/26 19:28:23: Error getting config status, workload certificates may not be configured: HTTP 404Jul 26 19:28:23 debian gce_workload_cert_refresh[1706]: 2023/07/26 19:28:23: Done
Jul 26 19:28:23 debian systemd[1]: gce-workload-cert-refresh.service: Succeeded.Jul 26 19:28:23 debian systemd[1]: Finished GCE Workload Certificate refresh.Jul 26 19:28:33 debian dhclient[527]: XMT: Solicit on ens4, interval 108670ms.Jul 26 19:30:01 debian CRON[1714]: (root) CMD (if [ `systemctl status slurmd | grep -c inactive` -gt 0 ]; then mount -a; systemctl restart munge; systemctl restart slurmd; fi)Jul 26 19:30:22 debian dhclient[527]: XMT: Solicit on ens4, interval 123750ms.Jul 26 19:32:01 debian CRON[1720]: (root) CMD (if [ `systemctl status slurmd | grep -c inactive` -gt 0 ]; then mount -a; systemctl restart munge; systemctl restart slurmd; fi)Jul 26 19:32:26 debian dhclient[527]: XMT: Solicit on ens4, interval 114030ms.Jul 26 19:33:23 debian systemd[1]: Starting Cleanup of Temporary Directories...
Jul 26 19:33:23 debian systemd-tmpfiles[1724]: /etc/tmpfiles.d/slurm.conf:1: Line references path below legacy directory /var/run/, updating /var/run/slurm → /run/slurm; please update the tmpfiles.d/ drop-in file accordingly.
Jul 26 19:33:24 debian systemd[1]: systemd-tmpfiles-clean.service: Succeeded.Jul 26 19:33:24 debian systemd[1]: Finished Cleanup of Temporary Directories.
Jul 26 19:34:01 debian CRON[1726]: (root) CMD (if [ `systemctl status slurmd | grep -c inactive` -gt 0 ]; then mount -a; systemctl restart munge; systemctl restart slurmd; fi)
Jul 26 19:34:20 debian dhclient[527]: XMT: Solicit on ens4, interval 112810ms.Jul 26 19:36:01 debian CRON[1734]: (root) CMD (if [ `systemctl status slurmd | grep -c inactive` -gt 0 ]; then mount -a; systemctl restart munge; systemctl restart slurmd; fi)Jul 26 19:36:12 debian dhclient[527]: XMT: Solicit on ens4, interval 131820ms.
Jul 26 19:37:23 debian systemd[1]: Created slice User Slice of UID 325659523.Jul 26 19:37:23 debian systemd[1]: Starting User Runtime Directory /run/user/325659523...
Jul 26 19:37:23 debian systemd[1]: Finished User Runtime Directory /run/user/325659523.
Jul 26 19:37:23 debian systemd[1]: Starting User Manager for UID 325659523...
Jul 26 19:37:23 debian systemd[1743]: Queued start job for default target Main User Target.
Jul 26 19:37:23 debian systemd[1743]: -.slice: Failed to migrate controller cgroups from /user.slice/user-325659523.slice/[email protected], ignoring: Permission denied
Jul 26 19:37:23 debian systemd[1743]: Created slice User Application Slice.
Jul 26 19:37:23 debian systemd[1743]: Reached target Paths.
Jul 26 19:37:23 debian systemd[1743]: Reached target Timers.
Jul 26 19:37:23 debian systemd[1743]: Starting D-Bus User Message Bus Socket.
Jul 26 19:37:23 debian systemd[1743]: Listening on GnuPG network certificate management daemon.
Jul 26 19:37:23 debian systemd[1743]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jul 26 19:37:23 debian systemd[1743]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jul 26 19:37:23 debian systemd[1743]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Jul 26 19:37:23 debian systemd[1743]: Listening on GnuPG cryptographic agent and passphrase cache.
Jul 26 19:37:23 debian systemd[1743]: Listening on debconf communication socket.
Jul 26 19:37:23 debian systemd[1743]: Listening on D-Bus User Message Bus Socket.Jul 26 19:37:23 debian systemd[1743]: Reached target Sockets.

yaroslavvb · 2023-07-26T20:08:44Z

yaroslavvb
Jul 26, 2023
Author

The possible issue I see is that I created new cluster image by cloning an existing cluster machine where our researchers manually installed the needed dependencies. So there could be files from previous slurm operation that are interfering, trying to figure out what those are.

0 replies

tpdownes · 2023-07-26T21:49:41Z

tpdownes
Jul 26, 2023
Maintainer

Short answer / best guess

You probably saved an image with the file /slurm/slurm_configured_do_not_remove in it. That "short circuits" Slurm configuration and causes it to not bother doing most of its startup operations. You can probably delete it in your broken cluster, reboot both controller and login nodes and see if it solves your problem. You should see slurmctld.service running on the controller (not slurmd) and slurmd running on the login node.

Longer advice

Outwardly, it's not obvious you have a functioning controller. The services and timers I'd expect to be active/running are:

# systemctl list-units | grep slurm
  slurmctld.service                                                                           loaded active running   Slurm controller daemon
  slurmdbd.service                                                                            loaded active running   Slurm DBD accounting daemon
  slurmeventd.service                                                                         loaded active running   Slurm Cluster Event Daemon
  slurmrestd.service                                                                          loaded active running   Slurm REST daemon
  slurm_load_bq.timer                                                                         loaded active waiting   Run load_bq.py regularly

Note: slurmd.service is typically only active on compute nodes and on login nodes that use configless operation. Our solution enables configless so slurmd.service will be active on every node except the controller.

# systemctl status munge

will probably show that it failed. You could look at /var/log/munged/munged.log for more details. Sometimes munged failure is a "canary in the coal mine" indicating a different failure. I almost always look at two things for a failed boot:

/slurm/scripts/setup.log
journalctl -u google-startup-scripts.service

1 reply

yaroslavvb Jul 26, 2023
Author

Thanks for the in-depth reply! Trying out your advice right now

yaroslavvb · 2023-07-26T22:36:24Z

yaroslavvb
Jul 26, 2023
Author

OK, this definitely do something, I now see

÷23-07-26 22:34:52|0|alpha7-controller|~\÷ systemctl list-units | grep slurm
  slurmctld.service                                                                            loaded active   running         Slurm controller daemon
● slurmd.service                                                                               loaded failed   failed          Slurm node daemon
  slurmdbd.service                                                                             loaded active   running         Slurm DBD accounting daemon
  slurmeventd.service                                                                          loaded active   running         Slurm Cluster Event Daemon
  slurmrestd.service                                                                           loaded active   running         Slurm REST daemon

However, now slurm setup failed with

2023-07-26 22:34:23,249 DEBUG: run_custom_scripts: custom scripts to run: /slurm/custom_scripts/(controller.d/ghpc_startup.sh)
2023-07-26 22:34:23,249 INFO: running script ghpc_startup.sh with timeout=300
2023-07-26 22:34:23,253 INFO: ghpc_startup.sh returncode=0
stdout=stderr=
2023-07-26 22:34:30,454 ERROR: CalledProcessError:
    command=['exportfs', '-a']
    returncode=1
    stdout:

    stderr:
exportfs: /opt/apps requires fsid= for NFS export

I'm actually not using /opt/apps for anything (is there a point in ops/apps if you have filestore?) so checking to see if I can remove this dependency from yaml

0 replies

tpdownes · 2023-07-26T22:52:24Z

tpdownes
Jul 26, 2023
Maintainer

In the big picture, this approach is going to be fragile. Let me suggest the following:

It is convenient that users have installed the software by hand for you. If this is easily converted to a script? If so, I would just follow https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/examples/image-builder.yaml (your blueprints adopt this approach so it seems like you are familiar with it)
If it not easily converted to a script but can be installed outside of systems paths (/usr, etc.) then I would do either:

install on a 2nd persistent disk and convert that disk to an image (images do not have to be boot disks). The node group, controller, and login modules all expose "additional_disks" settings.
install on a shared filesystem. if it's read-only (i.e. an application) the filesystem probably doesn't need to be too performant; the https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/examples/hpc-slurm.yaml blueprint demonstrates mounting shared filesystems across a Slurm cluster

0 replies

tpdownes · 2023-07-26T22:59:02Z

tpdownes
Jul 26, 2023
Maintainer

@yaroslavvb I'm sorry, a careful reading of the additional_disks setting suggests you may not be able to create them from an image. In any case, I'd encourage you to take approach (1) a one-time script used by Packer or (2) the shared filesystem approach if you prefer something users do by hand and you have less involvement in.

0 replies

yaroslavvb · 2023-07-26T23:22:03Z

yaroslavvb
Jul 26, 2023
Author

I'm aware it is fragile and long-term everything should be done through Packer scripts and reusable configurations. In the short term getting Docker/GPUs on a machine working required hours of trial and error, so I'm looking for some short-term workaround to get this configuration deployed

0 replies

yaroslavvb · 2023-07-27T18:46:01Z

yaroslavvb
Jul 27, 2023
Author

The underlying requirement is to run jobs which use Docker + GPU support.

This requires installing Docker + Nvidia GPU passthrough. These come with instructions for installing them locally using "sudo apt-get", I didn't yet find a supported alternative route for installing them that can be massaged into sequence of commands for the gHPC toolkit .yaml

Installing docker should not interfere with files added by GHPC toolkit, a potential way to get this approach to work is to undo all the modifications from gHPC toolkit. This is easy if it's just a matter of deleting some files like /slurm/slurm_configured_do_not_remove, and probably something in munged, but probably not worth it if it requires modifying files in addition to deleting them

1 reply

cboneti Aug 17, 2023
Maintainer

Hi Yaroslav, I will close this discussion since the idea is to address the ability to run docker in your blueprint itself (through a docker group). Please feel free to reopen it if you feel I am missing something here.
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging munge errors? #1621

{{title}}

Replies: 7 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Debugging munge errors? #1621

yaroslavvb Jul 26, 2023

Replies: 7 comments · 2 replies

yaroslavvb Jul 26, 2023 Author

tpdownes Jul 26, 2023 Maintainer

Short answer / best guess

Longer advice

yaroslavvb Jul 26, 2023 Author

yaroslavvb Jul 26, 2023 Author

tpdownes Jul 26, 2023 Maintainer

tpdownes Jul 26, 2023 Maintainer

yaroslavvb Jul 26, 2023 Author

yaroslavvb Jul 27, 2023 Author

cboneti Aug 17, 2023 Maintainer

yaroslavvb
Jul 26, 2023

Replies: 7 comments 2 replies

yaroslavvb
Jul 26, 2023
Author

tpdownes
Jul 26, 2023
Maintainer

yaroslavvb Jul 26, 2023
Author

yaroslavvb
Jul 26, 2023
Author

tpdownes
Jul 26, 2023
Maintainer

tpdownes
Jul 26, 2023
Maintainer

yaroslavvb
Jul 26, 2023
Author

yaroslavvb
Jul 27, 2023
Author

cboneti Aug 17, 2023
Maintainer