Debugging munge errors? #1621
Replies: 7 comments 2 replies
-
The possible issue I see is that I created new cluster image by cloning an existing cluster machine where our researchers manually installed the needed dependencies. So there could be files from previous slurm operation that are interfering, trying to figure out what those are. |
Beta Was this translation helpful? Give feedback.
-
Short answer / best guessYou probably saved an image with the file Longer adviceOutwardly, it's not obvious you have a functioning controller. The services and timers I'd expect to be active/running are:
Note: slurmd.service is typically only active on compute nodes and on login nodes that use configless operation. Our solution enables configless so slurmd.service will be active on every node except the controller.
will probably show that it failed. You could look at /var/log/munged/munged.log for more details. Sometimes munged failure is a "canary in the coal mine" indicating a different failure. I almost always look at two things for a failed boot:
|
Beta Was this translation helpful? Give feedback.
-
OK, this definitely do something, I now see
However, now slurm setup failed with
I'm actually not using /opt/apps for anything (is there a point in ops/apps if you have filestore?) so checking to see if I can remove this dependency from yaml |
Beta Was this translation helpful? Give feedback.
-
In the big picture, this approach is going to be fragile. Let me suggest the following:
|
Beta Was this translation helpful? Give feedback.
-
@yaroslavvb I'm sorry, a careful reading of the |
Beta Was this translation helpful? Give feedback.
-
I'm aware it is fragile and long-term everything should be done through Packer scripts and reusable configurations. In the short term getting Docker/GPUs on a machine working required hours of trial and error, so I'm looking for some short-term workaround to get this configuration deployed |
Beta Was this translation helpful? Give feedback.
-
The underlying requirement is to run jobs which use Docker + GPU support. This requires installing Docker + Nvidia GPU passthrough. These come with instructions for installing them locally using "sudo apt-get", I didn't yet find a supported alternative route for installing them that can be massaged into sequence of commands for the gHPC toolkit .yaml Installing docker should not interfere with files added by GHPC toolkit, a potential way to get this approach to work is to undo all the modifications from gHPC toolkit. This is easy if it's just a matter of deleting some files like |
Beta Was this translation helpful? Give feedback.
-
Can anyone suggest how I would go about debugging munged errors?
I've created a new cluster (yaml) and I see controller and login nodes, but the compute nodes did not get launched.
slurmd is not running and I see the following in the logs of controller node
On syslog I see this
Beta Was this translation helpful? Give feedback.
All reactions