NOBatchSystem class: run a Rocoto workflow without a batch system #18

samtrahan · 2018-07-25T17:04:42Z

Rocoto currently requires that all workflow tasks run as batch jobs. While Rocoto was designed for workflows running on HPC resources, which are always managed by batch systems, it is sometimes desirable to use Rocoto to manage small workflows on workstations or other systems that do not have a batch system. Additionally, there may be times, such as when tasks are extremely small computationally and of very short duration, when it is more appropriate to run a task on the local host rather than via the batch system.

samtrahan · 2018-07-27T13:39:23Z

I am moving the MOABSHBatchSystem to its own issue.

christopherwharrop · 2018-07-27T14:47:49Z

I want to emphasize here that the NoBatchSystem scheduler type should be designed and implemented as a stand-alone scheduler method selectable by users for use on any system, including a laptop/workstation. This has been requested before, but the use cases at the time were not compelling enough to devote time to it. Please ensure your solution will provide the capability in a general, robust, way for everyone, everywhere.

samtrahan · 2018-07-27T15:15:51Z

Chris,

Yes, that is how it is designed. We already have potential customers that need that to run the FV3 GFS Beta. It has no knowledge of batch systems. Instead, it tracks daemon processes using a directory on a filesystem to trade information. You can even kill the jobs from a remote machine by adding a $jobname.kill file in that directory.

The best part is:

<workflow scheduler="no">

as in, "stop bugging me about a scheduler, and just run the jobs."

Sincerely,
Sam Trahan

christopherwharrop · 2018-07-27T15:31:59Z

Are you using ~/.rocoto for storing the information about the processes being tracked?

samtrahan · 2018-07-27T16:49:58Z

Chris,

No. I let the user specify the directory. Using the home directory for scrub space or metadata-heavy activities is risky because the home quota is often small and the home partition is often less capable than others on the machine.

<workflow scheduler="no">
  <job_id_dir>/path/to/some/scrub/area</job_id_dir>  <!-- default is /tmp -->
  ...
</workflow>

The job_id_dir does not take a cyclestr because the BatchSystem classes do not know which cycle they are submitting for. Instead, it is all one area, as it would be for a real batch system. There is one file per job ($jobname.job) and the rocotorewind kills them by making a "kill" file "$jobname.kill"

I'm thinking of having no default for the job_id_dir, and forcing the user to specify it, as a safety measure.

christopherwharrop · 2018-07-27T17:40:35Z

Exposing those sorts of details make Rocoto less usable for novices. Please keep the default path set to /tmp or ~/.rocoto. Also, the tag you suggest doesn't make any sense except when the NoBatchSystem scheduler is chosen. So, it needs to be specified in a different way. Please provide a ":TempDirectory" configure option in the rocotorc file to provide a means for users who may want to override it.

samtrahan · 2018-07-30T13:54:21Z

Chris,

That is an excellent idea. I'll work on that soon.

samtrahan · 2018-07-31T13:30:28Z

Chris,

I do like your idea of configuring in the ~/.rocoto, but on second thought, we will also need a way for users to modify the job pid directory on a per-workflow basis. Some workflows will have thousands or tens of thousands of active jobs, which will result in such a huge amount of metadata access for pid work that you may have to split them across multiple filesystems or filesets.

I suggest we add a way to pass custom options to the batch system in a consistent manner. This would be documented as "advanced usage." Here is an example of how one might configure putting the NOBatchSystem and LSFBatchSystem into one workflow, and configure them separately.

<workflow scheduler="lsf,no">
    <scheduler name="lsf">  <!-- for configuring the big jobs that contain shell jobs -->
        <setting><name>calculate_affinity</name><value>true</value></setting>
    </scheduler>

    <scheduler name="no">  <!-- for configuring the "NOBatchSystem" part with the shell jobs -->
        <setting><name>job_pid_dir</name><value>/path/to/some/dir</value></setting>
    </scheduler>

    <metatask>
        ...
        <task ...>
            <scheduler name="no">  <!-- for configuring the "NOBatchSystem" part with the shell jobs -->
                <setting><name>job_pid_dir</name><value>/path/to/some/dir</value></setting>
            </scheduler>
        </task>
    </metatask>
</workflow>

This could have more interesting applications, like allowing the workflow to be split across multiple machines.

<task name="task1">
    <scheduler name="lsf">
        <setting><name>submit_command</name><value>ssh luna bsub</value></setting>
        <setting><name>stat_command</name><value>ssh luna bstat</value></setting>
        <setting><name>hist_command</name><value>ssh luna bhist</value></setting>
    </scheduler>
</task>

<task name="task2">
    <scheduler name="lsf">
        <setting><name>submit_command</name><value>ssh tide bsub</value></setting>
        <setting><name>stat_command</name><value>ssh tide bstat</value></setting>
        <setting><name>hist_command</name><value>ssh tide bhist</value></setting>
    </scheduler>
</task>

samtrahan · 2018-07-31T13:30:43Z

Un-closing. I closed this by accident.

christopherwharrop · 2018-07-31T14:31:22Z

Running a workflow with thousands of active tasks without a batch system is madness. That is not something that should be supported.

samtrahan · 2018-07-31T17:13:03Z

Chris,

Well then, there's the other matter of cluttering up the user's ~/.rocoto directory. We've already had people hit their quota because Rocoto keeps making copies of its configuration file, and generating huge log files, every time it runs. If you add to this some pid files, which will not be deleted if a user prematurely ends a rocoto workflow, then the problem will get even worse.

christopherwharrop · 2018-07-31T17:24:14Z

The pid files are extremely small. And there must be a way ensure stale files do not accumulate. The other issues are/were bugs that have been or need to be fixed. You can put the pid files in a tmp directory of the users' choosing (via the config option) and you can create subdirectories under that if you want to group them by workflow.

samtrahan · 2018-07-31T17:31:40Z

Chris,

The NOBatchSystem deletes the old files once it has recorded the job's status in the workflow database. If the user stops running rocotorun, then there will be some stale files lying around. If the user does that many times, for large workflows, then there will be thousands of files after a few months. Switching to /tmp eliminates the usefulness of the rocotorewind command, which is able to remotely kill a job by making a #{jobid}.kill file. The only clean solution is to have the user specify the job pid directory, just like we do with the log directory now.

Sincerely,
Sam Trahan

christopherwharrop · 2018-07-31T19:04:40Z

If the user stops running rocotorun, then there will be some stale files lying around. If the user does that many times, for large workflows, then there will be thousands of files after a few months.

Then find a way to prevent that from happening or to clean up the stale files.

samtrahan · 2018-08-03T15:12:31Z

This is part of a large problem that Rocoto does not scrub its ~/.rocoto/ files. I opened issue #27 and I will consider this issue (#18) to be unresolved until #27 is implemented.

samtrahan · 2018-08-03T15:27:59Z

I would like to mention that this feature is extensively tested and very stable. Apart from the jobid directory and scrubbing changes requested by Chris, it is definitely ready for a pull request.

danielabdi-noaa · 2022-12-07T14:06:41Z

@samtrahan Does the "NoBatchSystem" still work in rocoto? I tried specifying "no" / "none"/"" as the scheduler but rocoto complains at the line where that is done.

12/07/22 07:05:04 MST :: FV3LAM_wflow.xml :: Error: Element workflow failed to validate attributes.
12/07/22 07:05:04 MST :: FV3LAM_wflow.xml :: Error: Invalid attribute scheduler for element workflow at FV3LAM_wflow.xml:112.
12/07/22 07:05:04 MST :: FV3LAM_wflow.xml :: Error: Invalid attribute scheduler for element workflow at FV3LAM_wflow.xml:112.

Line 112

<workflow realtime="F" scheduler="&SCHED;" cyclethrottle="20">

where SCHED

<!ENTITY SCHED         "no">

You mentioned about the need to specify job_pid_dir for SCHED="no" which I have not done yet.
I have a "hackish" solution for srw-app that works, but it would be great to use the "nobatchsystem" feature if it works!
ufs-community/ufs-srweather-app#508

Thanks

christopherwharrop-noaa · 2022-12-07T17:26:52Z

@danielabdi-noaa - This issue and the associated code is more than 4 years old. While maintenance of existing capabilities and bug fixes have been a high priority, time for development of major new features has not been available for the past few years. This has not been given the attention it deserves. I am not convinced the implementation here is as robust as it needs to be for supporting execution of the UFS. We can talk offline about this if you want.

danielabdi-noaa · 2022-12-07T17:43:49Z

@christopherwharrop Thanks for the info. I thought the nobatchsystem.rb was added due to this issue but it looks like it was there from the beginning. I would like to know about it more when you have time. Thanks

christopherwharrop-noaa · 2022-12-07T17:53:58Z

Yes, it's confusing. That file was created as a placeholder a long time ago but the feature never materialized. It probably should be deleted. If this is ever added, then the one Sam made would take its place.

aerorahul · 2022-12-08T17:02:54Z

Looks like renewed interest in this feature from years back. Hope to see some action. 🤞🏾

This was referenced Jul 25, 2018

selective rocotorun #19

Closed

allow booting or force-completing of tasks in drained cycles #15

Closed

samtrahan mentioned this issue Jul 27, 2018

Need capability to launch tasks inside of other running batch jobs that serve as resource pools #22

Closed

samtrahan changed the title ~~Run part or all of a workflow with no batch system~~ NOBatchSystem class: run part or all of a workflow with no batch system Jul 27, 2018

samtrahan changed the title ~~NOBatchSystem class: run part or all of a workflow with no batch system~~ NOBatchSystem class: run a Rocoto workflow without a batch system Jul 27, 2018

christopherwharrop added the enhancement label Jul 27, 2018

samtrahan closed this as completed Jul 31, 2018

samtrahan reopened this Jul 31, 2018

danielabdi-noaa mentioned this issue Dec 7, 2022

[develop] Enable workflow runs on single node linux/mac machine using rocoto. ufs-community/ufs-srweather-app#508

Merged

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOBatchSystem class: run a Rocoto workflow without a batch system #18

NOBatchSystem class: run a Rocoto workflow without a batch system #18

samtrahan commented Jul 25, 2018 •

edited by christopherwharrop

Loading

samtrahan commented Jul 27, 2018

christopherwharrop commented Jul 27, 2018 •

edited

Loading

samtrahan commented Jul 27, 2018 •

edited

Loading

christopherwharrop commented Jul 27, 2018

samtrahan commented Jul 27, 2018

christopherwharrop commented Jul 27, 2018

samtrahan commented Jul 30, 2018

samtrahan commented Jul 31, 2018

samtrahan commented Jul 31, 2018

christopherwharrop commented Jul 31, 2018

samtrahan commented Jul 31, 2018

christopherwharrop commented Jul 31, 2018

samtrahan commented Jul 31, 2018 •

edited

Loading

christopherwharrop commented Jul 31, 2018

samtrahan commented Aug 3, 2018

samtrahan commented Aug 3, 2018

danielabdi-noaa commented Dec 7, 2022 •

edited

Loading

christopherwharrop-noaa commented Dec 7, 2022

danielabdi-noaa commented Dec 7, 2022

christopherwharrop-noaa commented Dec 7, 2022

aerorahul commented Dec 8, 2022

NOBatchSystem class: run a Rocoto workflow without a batch system #18

NOBatchSystem class: run a Rocoto workflow without a batch system #18

Comments

samtrahan commented Jul 25, 2018 • edited by christopherwharrop Loading

samtrahan commented Jul 27, 2018

christopherwharrop commented Jul 27, 2018 • edited Loading

samtrahan commented Jul 27, 2018 • edited Loading

christopherwharrop commented Jul 27, 2018

samtrahan commented Jul 27, 2018

christopherwharrop commented Jul 27, 2018

samtrahan commented Jul 30, 2018

samtrahan commented Jul 31, 2018

samtrahan commented Jul 31, 2018

christopherwharrop commented Jul 31, 2018

samtrahan commented Jul 31, 2018

christopherwharrop commented Jul 31, 2018

samtrahan commented Jul 31, 2018 • edited Loading

christopherwharrop commented Jul 31, 2018

samtrahan commented Aug 3, 2018

samtrahan commented Aug 3, 2018

danielabdi-noaa commented Dec 7, 2022 • edited Loading

christopherwharrop-noaa commented Dec 7, 2022

danielabdi-noaa commented Dec 7, 2022

christopherwharrop-noaa commented Dec 7, 2022

aerorahul commented Dec 8, 2022

samtrahan commented Jul 25, 2018 •

edited by christopherwharrop

Loading

christopherwharrop commented Jul 27, 2018 •

edited

Loading

samtrahan commented Jul 27, 2018 •

edited

Loading

samtrahan commented Jul 31, 2018 •

edited

Loading

danielabdi-noaa commented Dec 7, 2022 •

edited

Loading