Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow booting or force-completing of tasks in drained cycles #15

Closed
samtrahan opened this issue Jun 22, 2018 · 11 comments
Closed

allow booting or force-completing of tasks in drained cycles #15

samtrahan opened this issue Jun 22, 2018 · 11 comments

Comments

@samtrahan
Copy link
Contributor

Recent large-scale retrospectives on Jet and WCOSS have encountered situations where they need to force a rerun of a task in a cycle whose "final=true" tasks have completed. The current rocotoboot and rocotocomplete refuse to do this.

A branch, feature/naughty-boot, fixes this problem. The rocotoboot and rocotocomplete will give loud warnings and ask if you're sure. Users have tested this feature and confirmed it works.

@christopherwharrop
Copy link
Owner

The reason booting of draining/expired cycles is expressly denied is because there is no way to reactivate the cycle without it immediately being drained/expired again. I don't see where your code deals with that. How does it ensure that the job is tracked and resubmitted until it's successful. It looks like it will be submitted once, but since the cycle is still deactivated, Rocoto will never look at it again. Or do I have that wrong?

@samtrahan
Copy link
Contributor Author

Yes, the intent is to force a submit of a job without looking at it again. Booting a job that Rocoto doesn't want to boot is the most feature request I am getting from people running the FV3 GFS workflow. It is absolutely critical to be able to force a resubmit of any job, including ones from drained cycles. At present, there is no way to do that. This has led to people manually assembling a batch card and submitting it rather than having Rocoto do that for them.

We do eventually need a way to "un-drain" a cycle, but that is a large problem that requires changes in other parts of Rocoto. This change set is meant to solve the far more limited problem of allowing a job to be booted. People really cannot wait for major changes throughout Rocoto before they can force a submission of a job.

@christopherwharrop
Copy link
Owner

Ok. So, it's a quick and dirty hack to overcome an immediate issue then. My concern is that the booted jobs won't be tracked and so if they fail, the user will have to track it manually. I'm also concerned that users may be misusing the boot feature. It is meant as a repair tool for rare events when the user made a mistake. It's for correcting the odd problem here and there. It's not meant as a primary means of running jobs. If users relying on the boot feature for normal workflow operation, then they need to change the design of their workflow dependencies so that those conditions don't happen. If you want to merge develop onto your branch and submit a pull request, I'll reluctantly accept it. But, I'd like to know why FV3 GFS workflows are needing to have tasks booted so often, and what situations are creating that need.

@samtrahan
Copy link
Contributor Author

The intent is to reproduce the "ecflow_client --execute" functionality. If the workflow fails, or metascheduler fails, or the machine is crazy, it gives the operator a way to take over. The "execute" functionality will force a submission of a job, regardless of dependencies, limits, or existing jobs. It gives you a big, angry, "are you sure," message in the GUI first. We need a command like that for Rocoto. What I have done is sufficient to implement such a thing.

If you want me to fully implement "un-draining" of a cycle, that will take much longer. As you know, nobody in NOAA is tasked to do development of Rocoto. That is why the project has been dragging so much. I am only tasked to do emergency support, and I had to come up with a fix for this, so I did. To make a more capable change, like un-draining, I would need to be tasked to develop Rocoto for a while, which is unlikely to happen.

@christopherwharrop
Copy link
Owner

If you want this to be part of the next release, please get it in or submit a pull request soon.

@samtrahan
Copy link
Contributor Author

samtrahan commented Jul 5, 2018 via email

@samtrahan
Copy link
Contributor Author

We've run into problems with the FV3 GFS workflow that make it urgent for us to correctly handle booting, rewinding, and completing jobs in "drained" and "done" cycles. I have some fixes that will be in a branch shortly.

samtrahan pushed a commit that referenced this issue Jul 25, 2018
implements the "principle of least surprise" in those three commands:

  1. rocotoboot and rocotocomplete now have rocotorewind's -a option

  2. rocotorewind now has the same -c, -m, and -t options as
      rocotoboot and rocotocomplete

  3. In rocotoboot, rocotorewind, and rocotocomplete, repeated -t, -c,
     or -m options are appended instead of taking the last
     incarnation.

  4. Fix for a bug in the workflowdoc.rb that caused an exception to
     be raised when reading a workflow document in which a <streq> or
     <strneq> contained a <cyclestr>.

  5. "rocotorewind -a" will delete a cycle and all of its jobs from
     the database, thus having the cycle appear as if it has never
     been run.

There are a few more bits and pieces that must be implemented before
these features are complete.  Specifically, "rocotocomplete -a" should
mark a cycle as "done," and the -t option needs to take #final to mean
"all final tasks."
@samtrahan
Copy link
Contributor Author

The feature/naughty-boot branch is obviated by the feature/principle-of-least-surprise branch, which aims to fix the prior mentioned issues that Chris brought up in this comment chain, as well as other issues.

@christopherwharrop
Copy link
Owner

When you submit the pull request for this, please make sure that it is self-contained. One pull request per issue, please.

How are you addressing the fact that the booted job will not be tracked because the cycle has been drained/expired.

@samtrahan
Copy link
Contributor Author

Chris,

Issues #15, #17, and #19 are really the same issue. Issues #16, #18, and #20 can be each put in their own branch.

@samtrahan
Copy link
Contributor Author

Duplicate of #17. See comments about details of how rocotorewind has counter-intuitive results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants