Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global-Workflow GFS Archive Task Fails with Exit Code 72 #2494

Closed
ChristianBoyer-NOAA opened this issue Apr 16, 2024 · 10 comments
Closed

Global-Workflow GFS Archive Task Fails with Exit Code 72 #2494

ChristianBoyer-NOAA opened this issue Apr 16, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@ChristianBoyer-NOAA
Copy link

What is wrong?

'gfsarch' task of the global-workflow crashes when running multiple forecast cycles with exit code 72.

I am attempting to run the simulations for the summer months starting from 2020060100 and ending 2020083100. I set up the experiment to run forecasts at every initial conditions over the summer of 2020, which is every 3 days to produce 144-hr forecasts from initial conditions on 6/1, 6/4, 6/7, and so on.

However, the job crashes within the ‘gfsarch’ task for archiving the output and other files to HPSS. The archive task succeeds for 6/1 and all tarbells are successfully archived on HPSS, but it then fails on the other dates (e.g., 6/4 and 6/7). The exit code for the failed archive tasks is 72. I have listed relevant file paths and directories to my issue below under additional information. Thank you.

Brief snippet of error from gfsarch.log file for 2020060400 forecast cycle:

  • exglobal_archive.sh[306]: echo 'FATAL ERROR: htar /NCEPDEV/emc-global/1year/Christian.Boyer/HERA/scratch/hr3sum_con/2020060400/gfsa.tar failed'
    FATAL ERROR: htar /NCEPDEV/emc-global/1year/Christian.Boyer/HERA/scratch/hr3sum_con/2020060400/gfsa.tar failed
  • exglobal_archive.sh[307]: exit 72

What should have happened?

The 'gfsarch' task of the workflow successfully creates the tarbells and archives them to HPSS.

What machines are impacted?

Hera

Steps to reproduce

  1. Clone, install, and build global-workflow
  2. Set up experiment and generate xml file
    ./setup_expt.py gfs forecast-only --app S2SW --pslot $PSLOT--configdir $CONFIGDIR --idate 2020060100 --edate 2020063000 --resdetatmos 768 --resdetocean 0.25 --gfs_cyc 1 --comroot $COMROOT --expdir $EXPDIR
  3. Change 'cycledef group="gfs" to 72:00:00 (from 24:00:00) for 3 day forecast cycles
  4. Run workflow using rocotorun and cron

Additional information

Directories and Logfile Paths
EXPDIR: /scratch1/NCEPDEV/global/Christian.Boyer/save/para_ufs/hr3sum_con
ROTDIR: /scratch1/NCEPDEV/stmp2/Christian.Boyer/ROTDIRS/hr3sum_con/

Archive log files:
Succeed (6/1): /scratch1/NCEPDEV/stmp2/Christian.Boyer/ROTDIRS/hr3sum_con/logs/2020060100/gfsarch.log
Failed (6/4): /scratch1/NCEPDEV/stmp2/Christian.Boyer/ROTDIRS/hr3sum_con/logs/2020060400/gfsarch.log

Global-Workflow Hash:
4f0f773

Do you have a proposed solution?

No response

@ChristianBoyer-NOAA ChristianBoyer-NOAA added bug Something isn't working triage Issues that are triage labels Apr 16, 2024
Copy link
Contributor

This issue could be related to the HPSS upper-limit for the size of the tarballs.

@WalterKolczynski-NOAA
Copy link
Contributor

I've been having issues myself with HPSS the last day or so. I think this is unrelated to workflow.

@WalterKolczynski-NOAA
Copy link
Contributor

@ChristianBoyer-NOAA HPSS seems stable right now. Please rewind the archive job and let me know if it still fails.

@ChristianBoyer-NOAA
Copy link
Author

@ChristianBoyer-NOAA HPSS seems stable right now. Please rewind the archive job and let me know if it still fails.

I have rewound and reran the archive jobs and they have failed again. I do have separate single forecast runs of the global-workflow running for the dates that failed to see if those will successfully archive to HPSS.

@WalterKolczynski-NOAA
Copy link
Contributor

Okay, thanks for checking. The internet has been less than helpful on what htar error code 72 is.

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Apr 16, 2024
@HenryRWinterbottom
Copy link
Contributor

@WalterKolczynski-NOAA
Copy link
Contributor

Is this fixed following #2491?

@aerorahul
Copy link
Contributor

Hello @ChristianBoyer-NOAA
Have you encountered this issue with a recent version of the global-workflow?
rc=72 says a file is missing and the file has been marked to be archived.
So, its possible, that that file needs to be in the optional section, rather than required. We can change that.

@ChristianBoyer-NOAA
Copy link
Author

Hi @aerorahul
I have not tried this with a recent version of the workflow, and I moved away from trying to run the workflow in this manner. I have been submitting each model initialization date separately instead.

@DavidHuber-NOAA DavidHuber-NOAA self-assigned this Oct 7, 2024
@DavidHuber-NOAA
Copy link
Contributor

Ran a test on WCOSS2 on the dates specified in this PR (2020060100 and 2020060400). Archiving was successful for both cycles. This seems to have been resolved by #2491. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants