-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global-Workflow GFS Archive Task Fails with Exit Code 72 #2494
Comments
This issue could be related to the HPSS upper-limit for the size of the tarballs. |
I've been having issues myself with HPSS the last day or so. I think this is unrelated to workflow. |
@ChristianBoyer-NOAA HPSS seems stable right now. Please rewind the archive job and let me know if it still fails. |
I have rewound and reran the archive jobs and they have failed again. I do have separate single forecast runs of the global-workflow running for the dates that failed to see if those will successfully archive to HPSS. |
Okay, thanks for checking. The internet has been less than helpful on what htar error code 72 is. |
@WalterKolczynski-NOAA See https://hpc.llnl.gov/technical-bulletin-497-htar-update, about half-way down. |
Is this fixed following #2491? |
Hello @ChristianBoyer-NOAA |
Hi @aerorahul |
Ran a test on WCOSS2 on the dates specified in this PR (2020060100 and 2020060400). Archiving was successful for both cycles. This seems to have been resolved by #2491. Closing. |
What is wrong?
'gfsarch' task of the global-workflow crashes when running multiple forecast cycles with exit code 72.
I am attempting to run the simulations for the summer months starting from 2020060100 and ending 2020083100. I set up the experiment to run forecasts at every initial conditions over the summer of 2020, which is every 3 days to produce 144-hr forecasts from initial conditions on 6/1, 6/4, 6/7, and so on.
However, the job crashes within the ‘gfsarch’ task for archiving the output and other files to HPSS. The archive task succeeds for 6/1 and all tarbells are successfully archived on HPSS, but it then fails on the other dates (e.g., 6/4 and 6/7). The exit code for the failed archive tasks is 72. I have listed relevant file paths and directories to my issue below under additional information. Thank you.
Brief snippet of error from gfsarch.log file for 2020060400 forecast cycle:
FATAL ERROR: htar /NCEPDEV/emc-global/1year/Christian.Boyer/HERA/scratch/hr3sum_con/2020060400/gfsa.tar failed
What should have happened?
The 'gfsarch' task of the workflow successfully creates the tarbells and archives them to HPSS.
What machines are impacted?
Hera
Steps to reproduce
./setup_expt.py gfs forecast-only --app S2SW --pslot $PSLOT--configdir $CONFIGDIR --idate 2020060100 --edate 2020063000 --resdetatmos 768 --resdetocean 0.25 --gfs_cyc 1 --comroot $COMROOT --expdir $EXPDIR
Additional information
Directories and Logfile Paths
EXPDIR: /scratch1/NCEPDEV/global/Christian.Boyer/save/para_ufs/hr3sum_con
ROTDIR: /scratch1/NCEPDEV/stmp2/Christian.Boyer/ROTDIRS/hr3sum_con/
Archive log files:
Succeed (6/1): /scratch1/NCEPDEV/stmp2/Christian.Boyer/ROTDIRS/hr3sum_con/logs/2020060100/gfsarch.log
Failed (6/4): /scratch1/NCEPDEV/stmp2/Christian.Boyer/ROTDIRS/hr3sum_con/logs/2020060400/gfsarch.log
Global-Workflow Hash:
4f0f773
Do you have a proposed solution?
No response
The text was updated successfully, but these errors were encountered: