Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fit2Obs silent failure due to scrubbed data #3

Closed
CatherineThomas-NOAA opened this issue Apr 7, 2021 · 4 comments
Closed

Fit2Obs silent failure due to scrubbed data #3

CatherineThomas-NOAA opened this issue Apr 7, 2021 · 4 comments

Comments

@CatherineThomas-NOAA
Copy link

When running global-workflow on Hera, I encountered several failures of Fit2Obs with this message:

PRPI='/scratch1/NCEPDEV/global/glopara/git/verif/global/Fit2Obs/ncf-vqc/batrun/../ush/ACprof missing /scratch2/NCEPDEV/stmp1/Catherine.Thomas/ROTDIRS/comm_gnss_err/gdas.20201224/00/atmos/gdas.t00z.prepbufr or /scratch2/NCEPDEV/stmp1/Catherine.Thomas/ROTDIRS/comm_gnss_err/gdas.20201224/00/atmos/gdas.t00z.prepbufr.acft_profiles'

It looks like that the prepbufr files are missing because the gdas directory is getting scrubbed before Fit2Obs has a chance to run. The GDAS directories are set to scrub after 24 hours by default. I was able to circumvent this by setting it to 30 hours but this is not a permanent solution, only a bandaid.

Also since Fit2Obs runs in its own job outside rocoto, this failure does not feed back to the workflow. The experiment continues on, with no indication of failure.

@jack-woollen What are your thoughts on this? Could we add the prepbufr files to the vrfyarch directory and Fit2Obs pulls from there instead of gdas.yyyymmdd/hh/atmos? Though the error handling will probably have to come from the workflow side, I'm open to ideas on that front, too.

@jack-woollen
Copy link
Collaborator

@CatherineThomas-NOAA It needs the prepbufr and the cnvstat files along with the analysis and 126 hours of forecast files to tun properly. With or without putting it into the workflow, you would still have to block removal of those files until it ran. Then signal the workflow that it was finished, good or bad. I suppose a bad signal from "outside" could interrupt the workflow in its tracks if that was warrented. A good signal could likewise initiate file removal. You probably need to set the priority of the f2o job to at least match the workflow to make sure it runs in a timely way. Its all about coordination.

@CatherineThomas-NOAA
Copy link
Author

@jack-woollen Thanks for your comments. We are currently testing a solution to address this from inside the workflow only and would not require changes to Fit2Obs. This fix would prevent the arch clean up from removing the needed files from the gdas directories. We already do this for other files for the GLDAS step. Please take a look at global-workflow issue #311 for further discussion.

In the longer term, @KateFriedman-NOAA is looking at breaking up the vrfy step into subtasks, which would make Fit2Obs its own job and institute the signal of good/bad that you mentioned.

@RussTreadon-NOAA
Copy link

Please see workflow issue #311 for additional comments on this fit2obs issue. Two options have been implemented in a workflow test. One option requires no changes to fit2obs. The other option requires changing COM_INA in subfits scripts.

@KateFriedman-NOAA
Copy link
Member

The global-workflow issue #311 has been closed as completed, this silent failure should not be occurring any more. Closing this as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants