allow booting or force-completing of tasks in drained cycles #15

samtrahan · 2018-06-22T18:35:11Z

Recent large-scale retrospectives on Jet and WCOSS have encountered situations where they need to force a rerun of a task in a cycle whose "final=true" tasks have completed. The current rocotoboot and rocotocomplete refuse to do this.

A branch, feature/naughty-boot, fixes this problem. The rocotoboot and rocotocomplete will give loud warnings and ask if you're sure. Users have tested this feature and confirmed it works.

christopherwharrop · 2018-06-25T15:15:31Z

The reason booting of draining/expired cycles is expressly denied is because there is no way to reactivate the cycle without it immediately being drained/expired again. I don't see where your code deals with that. How does it ensure that the job is tracked and resubmitted until it's successful. It looks like it will be submitted once, but since the cycle is still deactivated, Rocoto will never look at it again. Or do I have that wrong?

samtrahan · 2018-06-25T18:19:10Z

Yes, the intent is to force a submit of a job without looking at it again. Booting a job that Rocoto doesn't want to boot is the most feature request I am getting from people running the FV3 GFS workflow. It is absolutely critical to be able to force a resubmit of any job, including ones from drained cycles. At present, there is no way to do that. This has led to people manually assembling a batch card and submitting it rather than having Rocoto do that for them.

We do eventually need a way to "un-drain" a cycle, but that is a large problem that requires changes in other parts of Rocoto. This change set is meant to solve the far more limited problem of allowing a job to be booted. People really cannot wait for major changes throughout Rocoto before they can force a submission of a job.

christopherwharrop · 2018-06-25T19:51:57Z

Ok. So, it's a quick and dirty hack to overcome an immediate issue then. My concern is that the booted jobs won't be tracked and so if they fail, the user will have to track it manually. I'm also concerned that users may be misusing the boot feature. It is meant as a repair tool for rare events when the user made a mistake. It's for correcting the odd problem here and there. It's not meant as a primary means of running jobs. If users relying on the boot feature for normal workflow operation, then they need to change the design of their workflow dependencies so that those conditions don't happen. If you want to merge develop onto your branch and submit a pull request, I'll reluctantly accept it. But, I'd like to know why FV3 GFS workflows are needing to have tasks booted so often, and what situations are creating that need.

samtrahan · 2018-06-26T15:26:01Z

The intent is to reproduce the "ecflow_client --execute" functionality. If the workflow fails, or metascheduler fails, or the machine is crazy, it gives the operator a way to take over. The "execute" functionality will force a submission of a job, regardless of dependencies, limits, or existing jobs. It gives you a big, angry, "are you sure," message in the GUI first. We need a command like that for Rocoto. What I have done is sufficient to implement such a thing.

If you want me to fully implement "un-draining" of a cycle, that will take much longer. As you know, nobody in NOAA is tasked to do development of Rocoto. That is why the project has been dragging so much. I am only tasked to do emergency support, and I had to come up with a fix for this, so I did. To make a more capable change, like un-draining, I would need to be tasked to develop Rocoto for a while, which is unlikely to happen.

christopherwharrop · 2018-07-03T16:42:38Z

If you want this to be part of the next release, please get it in or submit a pull request soon.

samtrahan · 2018-07-05T17:11:54Z

Chris, Could you make a specific deadline for these pulls? It would be nice to have a freeze date. Perhaps next Friday, July 13? Sincerely, Sam Trahan

…

On Tue, 3 Jul 2018, Christopher Harrop wrote: If you want this to be part of the next release, please get it in or submit a pull request soon. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.[ALaK5I8BaDy54os5sTsF_8ChqkK2soUAks5uC57-gaJpZM4U0Pyk.gif]

samtrahan · 2018-07-25T16:52:41Z

We've run into problems with the FV3 GFS workflow that make it urgent for us to correctly handle booting, rewinding, and completing jobs in "drained" and "done" cycles. I have some fixes that will be in a branch shortly.

implements the "principle of least surprise" in those three commands: 1. rocotoboot and rocotocomplete now have rocotorewind's -a option 2. rocotorewind now has the same -c, -m, and -t options as rocotoboot and rocotocomplete 3. In rocotoboot, rocotorewind, and rocotocomplete, repeated -t, -c, or -m options are appended instead of taking the last incarnation. 4. Fix for a bug in the workflowdoc.rb that caused an exception to be raised when reading a workflow document in which a <streq> or <strneq> contained a <cyclestr>. 5. "rocotorewind -a" will delete a cycle and all of its jobs from the database, thus having the cycle appear as if it has never been run. There are a few more bits and pieces that must be implemented before these features are complete. Specifically, "rocotocomplete -a" should mark a cycle as "done," and the -t option needs to take #final to mean "all final tasks."

samtrahan · 2018-07-25T19:01:26Z

The feature/naughty-boot branch is obviated by the feature/principle-of-least-surprise branch, which aims to fix the prior mentioned issues that Chris brought up in this comment chain, as well as other issues.

christopherwharrop · 2018-07-25T19:36:32Z

When you submit the pull request for this, please make sure that it is self-contained. One pull request per issue, please.

How are you addressing the fact that the booted job will not be tracked because the cycle has been drained/expired.

samtrahan · 2018-07-26T22:17:07Z

Chris,

Issues #15, #17, and #19 are really the same issue. Issues #16, #18, and #20 can be each put in their own branch.

samtrahan · 2018-07-26T22:28:41Z

Duplicate of #17. See comments about details of how rocotorewind has counter-intuitive results.

samtrahan mentioned this issue Jul 26, 2018

inconsistent arguments/behavior between rocoto boot, stat, check, run, complete, and rewind. #17

Closed

samtrahan closed this as completed Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow booting or force-completing of tasks in drained cycles #15

allow booting or force-completing of tasks in drained cycles #15

samtrahan commented Jun 22, 2018

christopherwharrop commented Jun 25, 2018

samtrahan commented Jun 25, 2018

christopherwharrop commented Jun 25, 2018

samtrahan commented Jun 26, 2018

christopherwharrop commented Jul 3, 2018

samtrahan commented Jul 5, 2018 via email

samtrahan commented Jul 25, 2018

samtrahan commented Jul 25, 2018

christopherwharrop commented Jul 25, 2018

samtrahan commented Jul 26, 2018

samtrahan commented Jul 26, 2018

allow booting or force-completing of tasks in drained cycles #15

allow booting or force-completing of tasks in drained cycles #15

Comments

samtrahan commented Jun 22, 2018

christopherwharrop commented Jun 25, 2018

samtrahan commented Jun 25, 2018

christopherwharrop commented Jun 25, 2018

samtrahan commented Jun 26, 2018

christopherwharrop commented Jul 3, 2018

samtrahan commented Jul 5, 2018 via email

samtrahan commented Jul 25, 2018

samtrahan commented Jul 25, 2018

christopherwharrop commented Jul 25, 2018

samtrahan commented Jul 26, 2018

samtrahan commented Jul 26, 2018