GCP toolkit recreates node in an infinite loop #1584
-
I have a node that gets killed and recreated by HPC toolkit. Below are Controller logs. We use reserved pre-paid instances, so if there's a problem with a machine we get from GCP, we need to communicate to GCP the nature of the problem so that they can fix it or issue a refund. Any suggestions how to troubleshoot this problem? Below are controller logs.
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
What looks odd to me is the In this case it looks like the node becomes healthy and is up for several minutes (5 min for the first occurance and 10 min for the second occurance). To debug further it would be most helpful to see the logs generated by alpha5-ultra-ghpc-28 while it is alive, particularly if there is any clue to why it is dying. There will likely be logs in cloud logging even if the instance is no longer alive if you search by the instance name or id. I would specifically be interested in |
Beta Was this translation helpful? Give feedback.
-
@yaroslavvb among the most useful diagnosis tools you might find is Cloud Logging. Here are some command I might advise for retrospective analysis. Suppose you have a machine named "my-vm" that came and went before you could use it or login to it. The following command will print some pretty useful Cloud Logging entries associated with a VM:
Notes:
Sometimes you need the "instance ID" of a machine. This is a unique (all of space and time) identifier for the VM that allows us to trace it to the hardware it ran on. This is interesting for your problem because a VM with the same name will have a different instance ID every time it is deleted and created.
Combining these two commands you can filter on a specific instance ID
It is also sometimes useful to browse to the Cloud Logging web interface and build queries using the GUI. Especially useful for constraining time queries or looking for entries for "my-vm" that are API calls that do not have that |
Beta Was this translation helpful? Give feedback.
-
@yaroslavvb your Cloud Logging entries for this problem may very well still exist. Those filesystem logs use UTC timestamps while the web interface will convert to local time. You can probably dig up some of the info. |
Beta Was this translation helpful? Give feedback.
-
There's no blueprint/Terraform "native" way to specify these settings, though you could try Based on your response, I'm going to close this discussion. Keep in mind the |
Beta Was this translation helpful? Give feedback.
@yaroslavvb among the most useful diagnosis tools you might find is Cloud Logging. Here are some command I might advise for retrospective analysis. Suppose you have a machine named "my-vm" that came and went before you could use it or login to it. The following command will print some pretty useful Cloud Logging entries associated with a VM:
Notes:
tac
to get it in "normal" order--project YOUR-PROJECT
if you are trying to read from so…