Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout errors from qstat cause UNAVAILABLE status #110

Closed
DavidHuber-NOAA opened this issue Aug 9, 2024 · 5 comments
Closed

Timeout errors from qstat cause UNAVAILABLE status #110

DavidHuber-NOAA opened this issue Aug 9, 2024 · 5 comments

Comments

@DavidHuber-NOAA
Copy link

On occasion, rocotostat will fail to get the status from a job via qstat within 45 seconds. This ends up resulting in an UNAVAILABLE status being reported for the job.

An example log file is available here /u/terry.mcguinness/ROCOTO.org/1.3.5/C96_atmaerosnowDA_d443bf9c/log.20240808 on WCOSS2:

08/08/24 20:56:49 UTC :: C96_atmaerosnowDA_d443bf9c.xml :: WARNING! The command 'qstat -x -f 147475818 | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n\t/ /g'' timed out after 45 seconds.
08/08/24 20:56:49 UTC :: C96_atmaerosnowDA_d443bf9c.xml :: Timeout::Error

More details on this failure are available in these comments:
NOAA-EMC/global-workflow#2755 (comment)
NOAA-EMC/global-workflow#2755 (comment)

@DavidHuber-NOAA
Copy link
Author

FYI @TerryMcGuinness-NOAA

@christopherwharrop-noaa
Copy link
Collaborator

Rocoto has no way to control the behavior of the host machine and batch system. The commands take as long as they take. The PBSPro interface has already been highly tuned to be as performant as possible (after observing issues on Cheyenne) so, while I can take another look, it's very unlikely further optimization is possible. PBSPro has a history of being prone to scaling/threading problems and it has been easy to overwhelm it.

You can try running qstat commands manually and see how long they take:

qstat -x -f #{joblist} | sed -e ':a' -e 'N' -e '$\!ba' -e 's/\\n\\t/ /g'"

You can also set JobQueueTimeout and JobAcctTimeout in the rocotorc file to something longer than 45 seconds, but keep in mind that if you're running rocotorun every 60 seconds, you will encounter issues if you make the timeout much longer than 45.

@christopherwharrop-noaa
Copy link
Collaborator

Another thing to keep in mind. If Rocoto cannot get the status of a job because status commands (e.g. qstat, squeue, etc.) hang and time out, Rocoto will mark the status as "UNAVAILABLE" because the status is not knowable. If the system issues persist long enough that the job status is purged by the batch system, because it only keeps status of jobs that have completed in the recent past, then the "UNAVAILABLE" status will become permanent. However, if the system recovers from the failure, a status update will succeed and the status will be retrieved. There isn't anything Rocoto can do other than make multiple attempts to retrieve the status and time out commands that hang.

@DavidHuber-NOAA
Copy link
Author

Thanks for the explanation, @christopherwharrop-noaa. I see now that our CI system does not check for UNAVAILABLE statuses, so I have added handling for such statuses in NOAA-EMC/global-workflow#2820. I'll close this issue once that PR is merged.

@DavidHuber-NOAA
Copy link
Author

Resolved by NOAA-EMC/global-workflow#2820.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants