Skip to content
Patrick Steadman edited this page Sep 28, 2017 · 20 revisions

Pydpiper is a set of Python libraries and associated executables intended primarily for running image processing pipelines on compute grids. It provides a domain-specific language (DSL) for constructing pipelines out of smaller components, wrappers for numerous command-line tools (currently largely MINC-centric, but currently expanding to some NIFTI- and ITK-based tools), code for constructing common pipeline topologies, and command-line wrappers to run some core pipelines.

Conceptual overview

Pydpiper code can be used from within Python or packaged into an application and called from the shell. Roughly speaking, the process is as follows: first, executing Pydpiper code determines the overall topology of a pipeline and the filenames of the input and output files of each step, compiling a graph of "stages" to be scheduled for execution; second, the Pydpiper server spawns "executors" (either remote jobs on a compute grid or subprocesses on a local machine) which get stages (usually shell commands) from the server as their dependencies are satisfied and run them.

Overview of common arguments

General arguments include --pipeline-name (to avoid annoying defaults), and --num-executors. The latter could in principle be determined semi-automatically but we don't really do this. If your grid setup assigns one processor per executor, then a good choice is perhaps the maximum "width" of expensive parts of your pipeline graph. For model building alone, this is usually just the number of input files; if also running MAGeT, it might be 25 times the number of input files.

Another important registration argument is --init-model, which specifies the path to a (currently MINC) file to which the input files will be rigidly aligned. Somewhat counterintuitively, the code expects the presence of a number of auxiliary files (TODO: link to documentation elsewhere). Alternatively, one may specify --initial-target or --bootstrap, but this is not recommended since these don't result in masks being applied to your inputs (although this is OK if you've masked them yourself and provided the masks to Pydpiper using the --csv-file argument).

You can (or must) specify various registration protocols in Pydpiper-specific format; see the applications_testing/test_data directory for examples.

Other execution-related arguments (--mem, --time, --max-walltime) can be set in a site-wide config file specified by either the --config-file flag or the $PYDPIPER_CONFIG_FILE shell variable. See the config/ directory in the code repository for some examples.

Monitoring an executing pipeline

Running the included check_pipeline_status.py script with a pipeline's <pipeline_name>_uri file as argument will provide a summary of running and finished stages, number of running executors, and other information.

An important source of truth is the pipeline.log file created in the pipeline's output directory. You can control the logging level by setting the shell environment variable PYRO_LOGLEVEL (before program start) to one of DEBUG, INFO (the default), WARN, or ERROR. INFO reports information about stages starting and finishing, while WARN and ERROR will only report potential problems with execution.

The <pipeline_name>_finished_stages file contains a rather uninformative list of completed stages by their number; in addition to counting the lines in this file, you can perform a join (using, e.g., Python's Pandas or R's tidyverse) with the <pipeline_name>_stages.txt file to determine which commands have run.

Each executor typically creates its own log file; these can be accessed in the logs/ subdirectory of the pipeline output directory, although it's sometimes a bit tedious to associate an executor with its stages. For the moment, grep is often a good option.

Individual stages also redirect stdout/stderr to a log file; this will typically be reported in the pipeline.log file and at the command line in case of an error, but for single-output stages is typically of form "[dir of output]/../log/[command producing output]/[output name without extension].log".

Restarting

Pydpiper applications which don't finish processing (due to crash, user cancellation, etc.) will remember previously finished computation across runs, provided the relevant directories and options aren't changed. This is based on a traversal of the dependency graph, so, e.g., restarting with a change to the nonlinear registration options won't affect the previous parts of the pipeline (unless these options are also used for masking). At the moment, restarting uses the ..._finished_stages file rather than the actual output files on disk, so deleting these will cause errors on a restart; deleting the finished_stages file will cause a restart from the beginning.

Tips

  • Keep your pipeline name (--pipeline-name) and ideally your input filenames relatively short. Our filename propagation is currently rather unwieldy and longer paths risk running over certain program-specific filename length limits, preventing the pipeline from starting.
  • In principle one can start additional executors (via pipeline_executor.py --uri-file ... --num-executors ... ) from the command line, but as we rarely do this we're not certain how well this works.
  • You can create csv files which are lists of input files and their masks (optional). The csv file structure is two columns named file and mask_file. This works for MBM.py and MAGeT.py (not for NLIN.py).

Pydpiper 2

  • MBM.py
  • MAGeT.py
  • twolevel_model_building.py
  • registration_chain.py

Pydpiper 1

The pydpiper (version 1) wiki currently lives here:

https://wiki.mouseimaging.ca/display/MICePub/Pydpiper