This tutorial is intended for those that attempt to assemble a bioinformatics pipeline using bionode-watermill for the first time.
This tutorial assumes that you have installed npm
, git
and node
. Node.js required for the full tutorial should be version 7 or higher.
To setup and test the scripts within this tutorial follow these simple steps:
git clone https://github.com/bionode/bionode-watermill-tutorial.git
cd bionode-watermill-tutorial
npm install bionode-watermill
Watermill is a tool that lets you orchestrate tasks. So, lets first understand how to define a task.
To define a task we first need to require bionode-watermill:
const watermill = require('bionode-watermill')
const task = watermill.task /* have to specify task because watermill object
has more variables*/
After, we can use task variable to define a given task:
- Using standard javascript style:
// this is a kiss example of how tasks work with shell
const simpleTask = task({
output: '*.txt', // checks if output file matches the specified pattern
params: 'test_file.txt', //defines parameters to be passed to the
// task function
name: 'This is the task name' //defines the name of the task
}, function(resolvedProps) {
const params = resolvedProps.params
return 'touch ' + params
}
)
- Or you can also do something like the following in ES6 syntax, using arrow functions:
// this is a kiss example of how tasks work with shell
const simpleTask = task({
output: '*.txt', // checks if output file matches the specified pattern
params: 'test_file.txt', /*defines parameters to be passed to the
task function*/
name: 'This is the task name' //defines the name of the task
}, ({ params }) => `touch ${params}`
)
Note: Template literals are very useful since they allow to include place holders (${ }) within strings. Template literals are enclosed by the back-tick (` `) as exemplified above.
Then after defining the task, it may be executed like this:
// runs the task and returns a promise, and can also return a callback
simpleTask()
This task will create a new file (empty) inside a directory named "data/<uid>/". You may also notice that a 'bunch' of text was outputted to terminal and it can be useful for debugging your pipelines.
The above example is available here.
You can test it by running: node simple_task.js
Although already discussed elsewhere within bionode-watermill documentation, in this tutorial I intend to explain how input/output are managed by bionode-watermill. First, you can either hardcore input to something like:
{ input: 'ERR1229296.sra' }
or instead you can specify glob patterns which are in fact better explained here. But, basically, what you need to know is that you can specify input to something like:
{ input: '*.sra' }
This tells bionode-watermill to crawl within the data
directory in search
for the first hit that matches this pattern. So, pay attention when specifying
this glob patterns if you have multiple .sra
files within this folder or
generated by other tasks that are not your target task (the last one that
generated a .sra
file in this example). To circumvent this you can provide
file names that you can easily manage. For instance if you have one file
named ERR1229296.sra
and another one ERR1229297.sra
and you want just
the first one, you can easily pass the input as follows:
{ input: '*6.sra' }
or of course hardcode it.
Output works in a very similar way, however there are a few specificities that the user must be aware of:
- Output object is not the output filename, it is used only to match the file extension to the expected result of the task. So despite necessary for proper resolving the task.
// this won't work!!!
{ output: 'myfile.txt' }
// rather you should provide this as follows:
{
output: '*.txt',
params: { output: 'myfile.txt' }
}
Remember, task.output is used to match the output file pattern and if you want to specify a given filename to the output you need to use task.params .output object instead where you can freely specify the output file name.
Join is an operator that lets you run a number of tasks in a given order.
For instance if we are interested in creating a file and writing to it
in two different instances. But let's first define a new task so we can
perform it after the task that we called simpleTask
:
const writeToFile = task({
input: '*.txt', // specifies the pattern of the expected input
output: '*.txt', // checks if output file matches the specified pattern
name: 'Write to file' //defines the name of the task
}, ({ input }) => `echo "some string" >> ${input}`
)
So, task writeToFile
writes "some string" to the file that we have just
created in task simpleTask
. However, to do so, we need the file to be
created first and only then write something to it.
In order to achieve this we use join
:
Before applying the pipeline first we need to require join
// === WATERMILL ===
const {
task,
join
} = require('bionode-watermill')
And then,
// this is a kiss example of how join works
const pipeline = join(simpleTask, writeToFile)
//executes the join itself
pipeline()
This operation will generate two directories inside data
folder, one which
is responsible for the first task (simpleTask
) that will create a new
file called test_file.txt
, and a second task (writeToFile
) that will do
a symlink to test_file.txt
and write to it, since we have indicated that
we would like to write for the same file as the input. Note that once again
files will be inside a directory named "data/<uid>/" (but in this case you
will have two directories with distinct uids).
The above example is available here.
You can test the above example by running: node simple_join.js
Unlike join, junction allows to run multiple tasks in parallel.
However, we will have to create a new task since if we simply replace in the
previous pipeline join with junction, we will end up with a file
named test_file.txt
with nothing written inside, because if you create the
file and write to it at the same time, write won't work, but the file will be
created.
But first, don't forget to:
// === WATERMILL ===
const {
task,
join,
junction
} = require('bionode-watermill')
And only then:
// this will not produce the file with text in it!
const pipeline = junction(simpleTask, writeToFile)
So, we will define a new simple task:
const writeAnotherFile = task({
output:'*.file', // specifies the pattern of the expected input
params: 'another_test_file.file', /* checks if output file matches the
specified pattern*/
name: 'Yet another task'
}, ({ params }) => `touch ${params} | echo "some new string" >> ${params}`
)
And then execute the new pipeline:
// this is a kiss example of how junction works
const pipeline = junction(
join(simpleTask, writeToFile), /* this "joint" tasks will be executed at the
same time as the task bellow */
writeAnotherFile
)
//executes the pipeline itself
pipeline()
This new pipeline consists on creating two files and writing text to them. Note
that in writeAnotherFile
task in this task pipe is used
in shell ("|") along with the shell commands touch
and echo
. That is a
feature that bionode-watermill also supports. Of course, these are simple
tasks that can be performed only with shell commands (but they are merely
illustrative). Instead, as mentioned above you can use javascript callback
functions or promises as the final return of a task.
Nevertheless, if you browse to data
folder, you should have three folders
(because you have three tasks). One with the text file generated in the first
task, another one with a symlink for the first task (that was used to write
to this file) and finally a third one in which you should have the file
generated and written in the third task (named another_test_file.file
).
The above example is available here.
You can test the above example by running: node simple_junction.js
While junction handles two or more tasks at the same time, fork allows to pass the output of two or more different tasks to the next task. Imagine you have two different files being generated in two different tasks and want to process them using the same task in the next step. In this case bionode-watermill uses fork, to split the pipeline in two distinct branches that after will be processed independently.
If you have something like:
join(
taskA,
fork(taskB, taskC),
taskD
)
This will result in something like this: taskA -> taskB -> taskD'
and
taskA -> taskC -> taskD''
, with two distinct final outputs for the
pipeline. This is a quite useful feature to benchmark programs or if you are
interested in running multiple programs that do the same type of analyses
and compare the results of both analyses.
Importantly, the same type of pipeline with junction instead of fork,
join(
taskA,
junction(taskB, taskC),
taskD
)
would result in the following workflow: taskA -> taskB, taskC -> taskD
,
where taskD has only one final result.
But enough talk, lets get to work!
First:
// === WATERMILL ===
const {
task,
join,
fork
} = require('bionode-watermill')
For the fork tutorial, two functions will be defined. These functions create a file and write to it:
const simpleTask1 = task({
output: '*.txt', // checks if output file matches the specified pattern
params: 'test_file.txt', //defines parameters to be passed to the
// task function
name: 'task1: creating file 1' //defines the name of the task
}, ({ params }) => `touch ${params} | echo "this is a string from first file" >> ${params}`
)
const simpleTask2 = task({
output:'*.txt', // specifies the pattern of the expected input
params: 'another_test_file.txt', /* checks if output file matches the
specified pattern*/
name: 'task 2: creating file 2'
}, ({ params }) => `touch ${params} | echo "this is a string from second file" >> ${params}`
)
Then, a task to be performed after the fork, which will add the same text to these files:
const appendFiles = task({
input: '*.txt', // specifies the pattern of the expected input
output: '*.txt', // checks if output file matches the specified patters
name: 'Write to files' //defines the name of the task
}, ({ input }) => `echo "after fork string" >> ${input}`
)
And finally our pipeline execution:
// this is a kiss example of how fork works
const pipeline = join(
fork(simpleTask1, simpleTask2),
appendFiles
)
//executes the pipeline itself
pipeline()
This should result in four output directories in our data
folder. Notice
that contrarily to junction, where three tasks would render three output
directories, with fork the result of our pipeline are four output
directories, where the outputs from simpleTask1
and simpleTask2
where
both processed by task appendFiles
.
The above example is available here.
You can test the above example by running: node simple_fork.js