GCP toolkit recreates node in an infinite loop #1584

yaroslavvb · 2023-07-14T00:45:21Z

yaroslavvb
Jul 14, 2023

I have a node that gets killed and recreated by HPC toolkit. Below are Controller logs.

We use reserved pre-paid instances, so if there's a problem with a machine we get from GCP, we need to communicate to GCP the nature of the problem so that they can fix it or issue a refund. Any suggestions how to troubleshoot this problem?

Below are controller logs.

[2023-07-13T23:31:50.215] job_str_signal(3): invalid JobId=2567                                                                                                  
[2023-07-13T23:31:50.215] _slurm_rpc_kill_job: job_str_signal() uid=9302157 JobId=2567 sig=9 returned: Invalid job id specified                                  
[2023-07-13T23:31:51.866] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=257 uid 9302157                                                                            
[2023-07-13T23:31:56.568] sched: Allocate JobId=258 NodeList=alpha5-ultra-ghpc-[3,8] #CPUs=96 Partition=ultra                                                    
[2023-07-13T23:44:58.236] sched: _slurm_rpc_allocate_resources JobId=259 NodeList=alpha5-ultra-ghpc-28 usec=6565                                               
                                                                                                                                                                 
==> resume.log <==                                                                                                                                               
2023-07-13 23:44:58,948 INFO: resume alpha5-ultra-ghpc-28                                                                                                        
                                                                                                                                                                 
==> slurmctld.log <==                                                                                                                                            
[2023-07-13T23:45:32.583] _job_complete: JobId=259 WTERMSIG 126                                                                                                  
[2023-07-13T23:45:32.583] _job_complete: JobId=259 cancelled by interactive user                                                                                 
[2023-07-13T23:45:32.992] _job_complete: JobId=259 done                                                                                                          
[2023-07-13T23:45:32.992] _slurm_rpc_complete_job_allocation: JobId=259 error Job/step already completing or completed                                           
[2023-07-13T23:45:33.000] error: get_addr_info: getaddrinfo() failed: Name or service not known                                                                  
[2023-07-13T23:45:33.000] error: slurm_set_addr: Unable to resolve "alpha5-ultra-ghpc-28"                                                                        
[2023-07-13T23:45:33.000] error: fwd_tree_thread: can't find address for host alpha5-ultra-ghpc-28, check slurm.conf                                             
[2023-07-13T23:45:34.087] sched: _slurm_rpc_allocate_resources JobId=260 NodeList=alpha5-ultra-ghpc-28 usec=4880                                                 
[2023-07-13T23:45:54.370] _job_complete: JobId=260 WTERMSIG 126                                                                                                  
[2023-07-13T23:45:54.370] _job_complete: JobId=260 cancelled by interactive user                                                                                 
[2023-07-13T23:45:54.370] _job_complete: JobId=260 done                                                                                                          
[2023-07-13T23:45:54.371] _slurm_rpc_complete_job_allocation: JobId=260 error Job/step already completing or completed                                           
[2023-07-13T23:45:54.373] _slurm_rpc_complete_job_allocation: JobId=260 error Job/step already completing or completed                                           
[2023-07-13T23:45:54.376] error: get_addr_info: getaddrinfo() failed: Name or service not known                                                                  
[2023-07-13T23:45:54.376] error: slurm_set_addr: Unable to resolve "alpha5-ultra-ghpc-28"                                                                        
[2023-07-13T23:45:54.376] error: fwd_tree_thread: can't find address for host alpha5-ultra-ghpc-28, check slurm.conf                                             
                                                                                                                                                                 
==> resume.log <==                                                                                                                                               
2023-07-13 23:47:01,069 INFO: created 1 instances: nodes=alpha5-ultra-ghpc-28                                                                                    
2023-07-13 23:47:01,073 INFO: create 1 subscriptions (alpha5-ultra-ghpc-28)                                                                                      
2023-07-13 23:47:12,093 INFO: Subscription created: projects/contextual-research-common/subscriptions/alpha5-ultra-ghpc-28                                       
                                                                                                                                                                 
==> slurmctld.log <==                                                                                                                                            
[2023-07-13T23:49:03.675] Node alpha5-ultra-ghpc-28 now responding                                                                                               
[2023-07-13T23:53:42.473] error: Nodes alpha5-ultra-ghpc-28 not responding                                                                                       
                                                                                                                                                                 
==> suspend.log <==                                                                                                                                              
2023-07-13 23:54:04,258 INFO: suspend alpha5-ultra-ghpc-28                                                                                                       
2023-07-13 23:54:04,962 INFO: delete 1 subscriptions (alpha5-ultra-ghpc-28)                                                                                      
2023-07-13 23:54:16,177 INFO: Subscription deleted: projects/contextual-research-common/subscriptions/alpha5-ultra-ghpc-28.                                      
2023-07-13 23:54:16,194 INFO: delete 1 instances (alpha5-ultra-ghpc-28)                                                                                          
2023-07-13 23:55:59,284 INFO: deleted 1 instances alpha5-ultra-ghpc-28                                                                                           
                                                                                                                                                                 
==> slurmsync.log <==                                                           
2023-07-13 23:56:02,780 INFO: 1 nodes to idle (alpha5-ultra-ghpc-28)                                                                                             
                                                                                                                                                                 
==> slurmctld.log <==                                                                                                                                            
[2023-07-13T23:56:02.785] update_node: node alpha5-ultra-ghpc-28 state set to IDLE                                                                               
[2023-07-14T00:11:15.732] sched: _slurm_rpc_allocate_resources JobId=261 NodeList=(null) usec=3719                                                               
[2023-07-14T00:11:18.197] _job_complete: JobId=261 WTERMSIG 126                                                                                                  
[2023-07-14T00:11:18.197] _job_complete: JobId=261 cancelled by interactive user                                                                                 
[2023-07-14T00:11:18.197] _job_complete: JobId=261 done                                                                                                          
[2023-07-14T00:11:20.972] _job_complete: JobId=248 WEXITSTATUS 1                                                                                                 
[2023-07-14T00:11:21.345] _job_complete: JobId=248 done                                                                                                          
[2023-07-14T00:11:31.642] sched: _slurm_rpc_allocate_resources JobId=262 NodeList=alpha5-ultra-ghpc-[0,28] usec=5982                                             
                                                                                                                                                                 
==> resume.log <==                                                                                                                                               
2023-07-14 00:11:32,589 INFO: resume alpha5-ultra-ghpc-28                                                                                                        
2023-07-14 00:13:35,009 INFO: created 1 instances: nodes=alpha5-ultra-ghpc-28                                                                                    
2023-07-14 00:13:35,013 INFO: create 1 subscriptions (alpha5-ultra-ghpc-28)                                                                                      
2023-07-14 00:13:47,133 INFO: Subscription created: projects/contextual-research-common/subscriptions/alpha5-ultra-ghpc-28                                       
                                                                                                                                                                 
==> slurmctld.log <==                                                                                                                                            
[2023-07-14T00:15:38.895] Node alpha5-ultra-ghpc-28 now responding                                                                                               
[2023-07-14T00:16:00.644] job_time_limit: Configuration for JobId=262 complete                                                                                   
[2023-07-14T00:16:00.644] Resetting JobId=262 start time for node power up                                                                                       
[2023-07-14T00:18:46.294] sched: _slurm_rpc_allocate_resources JobId=263 NodeList=(null) usec=4338                                                               
[2023-07-14T00:18:48.916] _job_complete: JobId=263 WTERMSIG 126                                                                                                  
[2023-07-14T00:18:48.917] _job_complete: JobId=263 cancelled by interactive user                                                                                 
[2023-07-14T00:18:48.917] _job_complete: JobId=263 done                                                                                                          
[2023-07-14T00:24:28.862] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=262 uid 1621545018                                                                         
[2023-07-14T00:24:30.284] _job_complete: JobId=239 WTERMSIG 126                                                                                                  
[2023-07-14T00:24:30.285] _job_complete: JobId=239 cancelled by interactive user                                                                                 
[2023-07-14T00:24:30.838] _job_complete: JobId=239 done                                                                                                          
[2023-07-14T00:25:22.779] update_node: node alpha5-ultra-ghpc-28 reason set to: broken                                                                           
[2023-07-14T00:25:22.780] update_node: node alpha5-ultra-ghpc-28 state set to DRAINED                                                                            
[2023-07-14T00:26:26.246] sched: _slurm_rpc_allocate_resources JobId=264 NodeList=alpha5-ultra-ghpc-[14-15] usec=5445                                            
                                                                                                                                                                 
==> suspend.log <==                                                                                                                                              
2023-07-14 00:29:39,096 INFO: suspend alpha5-ultra-ghpc-28                                                                                                       
2023-07-14 00:29:39,856 INFO: delete 1 subscriptions (alpha5-ultra-ghpc-28)                                                                                      
2023-07-14 00:29:50,887 INFO: Subscription deleted: projects/contextual-research-common/subscriptions/alpha5-ultra-ghpc-28.                                      
2023-07-14 00:29:50,904 INFO: delete 1 instances (alpha5-ultra-ghpc-28)                                                                                          
2023-07-14 00:31:39,985 INFO: deleted 1 instances alpha5-ultra-ghpc-28                                                                                           
                                                                                                                                                                 
==> slurmsync.log <==                                                                                                                                            
2023-07-14 00:32:03,533 INFO: 1 nodes to idle (alpha5-ultra-ghpc-28)                                                                                             
                                                                                                                                                                 
==> slurmctld.log <==                                                                                                                                            
[2023-07-14T00:32:03.538] update_node: node alpha5-ultra-ghpc-28 state set to IDLE

Answered by tpdownes

Jul 26, 2023

@yaroslavvb among the most useful diagnosis tools you might find is Cloud Logging. Here are some command I might advise for retrospective analysis. Suppose you have a machine named "my-vm" that came and went before you could use it or login to it. The following command will print some pretty useful Cloud Logging entries associated with a VM:

gcloud logging read 'labels."compute.googleapis.com/resource_name"="my-vm"' --format="table(timestamp, jsonPayload.message)" --freshness 1h

Notes:

this prevents in the reverse order of time you expect from Linux logs; you can pipe it to tac to get it in "normal" order
you might need to specify --project YOUR-PROJECT if you are trying to read from so…

View full answer

nick-stroud · 2023-07-14T15:45:39Z

nick-stroud
Jul 14, 2023
Maintainer

What looks odd to me is the Node alpha5-ultra-ghpc-28 now responding message. When I have seen node cycling previously, the node usually never becomes healthy and the controller gives up at some point and kills the node.

In this case it looks like the node becomes healthy and is up for several minutes (5 min for the first occurance and 10 min for the second occurance).

To debug further it would be most helpful to see the logs generated by alpha5-ultra-ghpc-28 while it is alive, particularly if there is any clue to why it is dying. There will likely be logs in cloud logging even if the instance is no longer alive if you search by the instance name or id.

I would specifically be interested in /var/log/messages and anything in var/log/slurm/*.

4 replies

yaroslavvb Jul 14, 2023
Author

OK, I'll try to catch it next time this happens. One thing i'm missing is some logging which explains why the node is being brought down. It's a little unnerving to see the toolkit taking down our instances without understanding the reason -- I could be logged into that machine to debug things.

nick-stroud Jul 20, 2023
Maintainer

Just to clarify, it is Slurm that is taking these nodes down with the suspend.py script, not the HPC Toolkit. I have brought this discussion to the attention of the github.com/SchedMD/slurm-gcp developers.

bsngardner Jul 20, 2023

The slurmsync.py script is responsible for bringing down unresponsive nodes, which here means the slurmd on that node never checked in to the controller. You can stop the slurmsync task if you want the instance to stay up. Depending on your version, it's either a Cron job on the slurm user or (more recently) a systemd timer service called slurmcmd.timer (slurm cluster management daemon).

As for what is happening here, you would need to see the slurmd.log and setup.log from the node. Depending on what failed, the slurmd.log might have been uploaded to cloud logging before the node terminated.

edit: To clarify, the slurmsync reconciles slurm's node state with VMs on GCP. If a node is down in slurm but has a VM up, it will terminate the VM to allow the node to return to IDLE+POWERED_DOWN.

nick-stroud Jul 20, 2023
Maintainer

This:

==> slurmctld.log <==                                                                                                                                            
[2023-07-13T23:49:03.675] Node alpha5-ultra-ghpc-28 now responding                                                                                               
[2023-07-13T23:53:42.473] error: Nodes alpha5-ultra-ghpc-28 not responding

makes me think that the node does check in with the controller at some point but then stops responding.

+1 to needing logs from compute node to debug further and understand why the compute node stopped responding. @yaroslavvb, is there anything available in cloud logging for alpha5-ultra-ghpc-28 (similar to how you pulled data in #1593).

tpdownes · 2023-07-26T22:18:53Z

tpdownes
Jul 26, 2023
Maintainer

@yaroslavvb among the most useful diagnosis tools you might find is Cloud Logging. Here are some command I might advise for retrospective analysis. Suppose you have a machine named "my-vm" that came and went before you could use it or login to it. The following command will print some pretty useful Cloud Logging entries associated with a VM:

gcloud logging read 'labels."compute.googleapis.com/resource_name"="my-vm"' --format="table(timestamp, jsonPayload.message)" --freshness 1h

Notes:

this prevents in the reverse order of time you expect from Linux logs; you can pipe it to tac to get it in "normal" order
you might need to specify --project YOUR-PROJECT if you are trying to read from something other than your default project
alter "freshness" accordingly. There are other ways of filtering for time, but I find freshness the easiest at the command line

Sometimes you need the "instance ID" of a machine. This is a unique (all of space and time) identifier for the VM that allows us to trace it to the hardware it ran on. This is interesting for your problem because a VM with the same name will have a different instance ID every time it is deleted and created.

gcloud logging read 'labels."compute.googleapis.com/resource_name"="my-vm"' --format="table(timestamp, resource.labels.instance_id)" --freshness 24h | tac

Combining these two commands you can filter on a specific instance ID

gcloud logging  read 'labels."compute.googleapis.com/resource_name"="my-vm" AND resource.labels.instance_id="1846390157162066218"' --format="table(timestamp, jsonPayload.message)" --freshness 24h

It is also sometimes useful to browse to the Cloud Logging web interface and build queries using the GUI. Especially useful for constraining time queries or looking for entries for "my-vm" that are API calls that do not have that resource_name label on them (that label appears on logs from the VM).

https://console.cloud.google.com/logs/query

1 reply

yaroslavvb Aug 24, 2023
Author

Just used this today to troubleshoot Slurm unexpectedly deleting our GPU nodes (again), pretty useful recipe! @cboneti

tpdownes · 2023-07-26T22:20:01Z

tpdownes
Jul 26, 2023
Maintainer

@yaroslavvb your Cloud Logging entries for this problem may very well still exist. Those filesystem logs use UTC timestamps while the web interface will convert to local time. You can probably dig up some of the info.

1 reply

yaroslavvb Jul 26, 2023
Author

So I've looked at my blueprint used in this setting and realized I've had dynamic_max: 1. This was around the time when cluster was pretty full, so it was expected that Slurm controller would bring the node up and down.

Sorry for confusion, being able to see the reason why Slurm is issuing take down requests would help avoid it in the future. Looking at Slurm GCP docs, it looks like there's a flag to enable additional logging from Slurm

https://github.com/SchedMD/slurm-gcp/blob/master/docs/faq.md#how-do-i-enable-additional-logging-for-slurm-gcp

Is this something I should set through Google HPC toolkit yaml file, or should this be done manually after the cluster is up?

tpdownes · 2023-07-26T22:39:02Z

tpdownes
Jul 26, 2023
Maintainer

There's no blueprint/Terraform "native" way to specify these settings, though you could try controller_startup_script on the controller module, startup_script on the login and partition modules.

Based on your response, I'm going to close this discussion. Keep in mind the gcloud logging commands above for retrospective analysis.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP toolkit recreates node in an infinite loop #1584

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

GCP toolkit recreates node in an infinite loop #1584

yaroslavvb Jul 14, 2023

Replies: 4 comments · 6 replies

nick-stroud Jul 14, 2023 Maintainer

yaroslavvb Jul 14, 2023 Author

nick-stroud Jul 20, 2023 Maintainer

bsngardner Jul 20, 2023

nick-stroud Jul 20, 2023 Maintainer

tpdownes Jul 26, 2023 Maintainer

yaroslavvb Aug 24, 2023 Author

tpdownes Jul 26, 2023 Maintainer

yaroslavvb Jul 26, 2023 Author

tpdownes Jul 26, 2023 Maintainer

yaroslavvb
Jul 14, 2023

Replies: 4 comments 6 replies

nick-stroud
Jul 14, 2023
Maintainer

yaroslavvb Jul 14, 2023
Author

nick-stroud Jul 20, 2023
Maintainer

nick-stroud Jul 20, 2023
Maintainer

tpdownes
Jul 26, 2023
Maintainer

yaroslavvb Aug 24, 2023
Author

tpdownes
Jul 26, 2023
Maintainer

yaroslavvb Jul 26, 2023
Author

tpdownes
Jul 26, 2023
Maintainer