-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow booting or force-completing of tasks in drained cycles #15
Comments
The reason booting of draining/expired cycles is expressly denied is because there is no way to reactivate the cycle without it immediately being drained/expired again. I don't see where your code deals with that. How does it ensure that the job is tracked and resubmitted until it's successful. It looks like it will be submitted once, but since the cycle is still deactivated, Rocoto will never look at it again. Or do I have that wrong? |
Yes, the intent is to force a submit of a job without looking at it again. Booting a job that Rocoto doesn't want to boot is the most feature request I am getting from people running the FV3 GFS workflow. It is absolutely critical to be able to force a resubmit of any job, including ones from drained cycles. At present, there is no way to do that. This has led to people manually assembling a batch card and submitting it rather than having Rocoto do that for them. We do eventually need a way to "un-drain" a cycle, but that is a large problem that requires changes in other parts of Rocoto. This change set is meant to solve the far more limited problem of allowing a job to be booted. People really cannot wait for major changes throughout Rocoto before they can force a submission of a job. |
Ok. So, it's a quick and dirty hack to overcome an immediate issue then. My concern is that the booted jobs won't be tracked and so if they fail, the user will have to track it manually. I'm also concerned that users may be misusing the boot feature. It is meant as a repair tool for rare events when the user made a mistake. It's for correcting the odd problem here and there. It's not meant as a primary means of running jobs. If users relying on the boot feature for normal workflow operation, then they need to change the design of their workflow dependencies so that those conditions don't happen. If you want to merge develop onto your branch and submit a pull request, I'll reluctantly accept it. But, I'd like to know why FV3 GFS workflows are needing to have tasks booted so often, and what situations are creating that need. |
The intent is to reproduce the "ecflow_client --execute" functionality. If the workflow fails, or metascheduler fails, or the machine is crazy, it gives the operator a way to take over. The "execute" functionality will force a submission of a job, regardless of dependencies, limits, or existing jobs. It gives you a big, angry, "are you sure," message in the GUI first. We need a command like that for Rocoto. What I have done is sufficient to implement such a thing. If you want me to fully implement "un-draining" of a cycle, that will take much longer. As you know, nobody in NOAA is tasked to do development of Rocoto. That is why the project has been dragging so much. I am only tasked to do emergency support, and I had to come up with a fix for this, so I did. To make a more capable change, like un-draining, I would need to be tasked to develop Rocoto for a while, which is unlikely to happen. |
If you want this to be part of the next release, please get it in or submit a pull request soon. |
Chris,
Could you make a specific deadline for these pulls? It would be nice to
have a freeze date. Perhaps next Friday, July 13?
Sincerely,
Sam Trahan
…On Tue, 3 Jul 2018, Christopher Harrop wrote:
If you want this to be part of the next release, please get it in or submit a pull request soon.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.[ALaK5I8BaDy54os5sTsF_8ChqkK2soUAks5uC57-gaJpZM4U0Pyk.gif]
|
We've run into problems with the FV3 GFS workflow that make it urgent for us to correctly handle booting, rewinding, and completing jobs in "drained" and "done" cycles. I have some fixes that will be in a branch shortly. |
implements the "principle of least surprise" in those three commands: 1. rocotoboot and rocotocomplete now have rocotorewind's -a option 2. rocotorewind now has the same -c, -m, and -t options as rocotoboot and rocotocomplete 3. In rocotoboot, rocotorewind, and rocotocomplete, repeated -t, -c, or -m options are appended instead of taking the last incarnation. 4. Fix for a bug in the workflowdoc.rb that caused an exception to be raised when reading a workflow document in which a <streq> or <strneq> contained a <cyclestr>. 5. "rocotorewind -a" will delete a cycle and all of its jobs from the database, thus having the cycle appear as if it has never been run. There are a few more bits and pieces that must be implemented before these features are complete. Specifically, "rocotocomplete -a" should mark a cycle as "done," and the -t option needs to take #final to mean "all final tasks."
The feature/naughty-boot branch is obviated by the feature/principle-of-least-surprise branch, which aims to fix the prior mentioned issues that Chris brought up in this comment chain, as well as other issues. |
When you submit the pull request for this, please make sure that it is self-contained. One pull request per issue, please. How are you addressing the fact that the booted job will not be tracked because the cycle has been drained/expired. |
Duplicate of #17. See comments about details of how rocotorewind has counter-intuitive results. |
Recent large-scale retrospectives on Jet and WCOSS have encountered situations where they need to force a rerun of a task in a cycle whose "final=true" tasks have completed. The current rocotoboot and rocotocomplete refuse to do this.
A branch, feature/naughty-boot, fixes this problem. The rocotoboot and rocotocomplete will give loud warnings and ask if you're sure. Users have tested this feature and confirmed it works.
The text was updated successfully, but these errors were encountered: