Package website: release | dev
R package to interface some popular parallelization back-ends with a unified interface.
-
Official CRAN release site: https://cran.r-project.org/web/packages/parallelMap/index.html
-
Development version:
remotes::install_github("mlr-org/parallelMap")
parallelMap is considered retired from the mlr-org team. We won't add new features anymore and will only fix severe bugs. We suggest to use other parallelization frameworks such as the future package. The new mlr3 framework also relies on the future package for parallelization and not on parallelMap anymore as mlr did.
parallelMap was written with users in mind who want a unified parallelization procedure in R that
- Works equally well in interactive operations as in developing packages where some operations should offer the possibility to be run in parallel by the client user of your package
- Allows the client user of your developed package to completely configure the parallelization from the outside
- Allows you to be lazy and forgetful. This entails: The same interface for every back-end and everything is easily configurable via options
- Supports the most important parallelization modes. For me, these currently are: usage of multiple cores on a single machine, socket mode (because it also works on Windows), MPI and HPC clusters (the latter interfaced by our BatchJobs package)
- Does not make debugging annoying and tedious.
Here is a short tutorial that already contains the most important concepts and operations:
##### Example 1) #####
library(parallelMap)
parallelStartSocket(2) # start in socket mode and create 2 processes on localhost
f = function(i) i + 5 # define our job
y = parallelMap(f, 1:2) # like R's Map but in parallel
parallelStop() # turn parallelization off again
If you want to use other modes of parallelization, call the appropriate initialization procedure, all of them are documented in parallelStart. parallelStart()
is a catch-all procedure, that allows to set all possible options of the package, but for every mode a variant of parallelStart()
exists with a smaller, appropriate interface.
In many (more complex) applications you somehow need to initialize the slave processes, especially for MPI, socket and BatchJobs mode, where fresh R processes are started. This means: loading of packages, sourcing files with function and object definitions and exporting R objects to the global environment of the slaves.
parallelMap supports these operations with the following three functions
Let's start with loading a package on the slaves. Of course you could put a require("mypackage")
into the body of f
, but you can also use a parallelLibrary()
before calling parallelMap()
.
##### Example 2) #####
library(parallelMap)
parallelStartSocket(2)
parallelLibrary("MASS")
# subsample iris, fit an LDA model and return prediction error
f = function(i) {
n = nrow(iris)
train = sample(n, n/2)
test = setdiff(1:n, train)
model = lda(Species~., data=iris[train,])
pred = predict(model, newdata=iris[test,])
mean(pred$class != iris[test,]$Species)
}
y = parallelMap(f, 1:2)
parallelStop()
And here is a further example where we export a big matrix to the slaves, then apply a preprocessing function to it, which is defined in source file.
##### Example 3) #####
library(parallelMap)
parallelStartSocket(2)
parallelSource("preproc.R") # contains definition of preproc()
bigmatrix = matrix(1, nrow=500, ncol=500)
parallelExport("bigmatrix")
f = function(i) {
p = preproc(bigmatrix)
p + i
}
y = parallelMap(f, 1:2)
parallelStop()
On a given system, you will probably always parallelize you operations in a similar fashion. For this reason, parallelMap()
allows you to define defaults for all relevant settings through R's option mechanism in , e.g., your R profile.
Let's assume on your office PC you run some Unix-like operating system and have 4 cores at your disposal.
You are also an experienced user and don't need parallelMap()
's "chatting" on the console anymore.
Define these lines in your R profile:
options(
parallelMap.default.mode = "multicore",
parallelMap.default.cpus = 4,
parallelMap.default.show.info = FALSE
)
This allows you to save some typing as running parallelStart()
will now be equivalent to parallelStart(mode = "multicore", cpus = 4, show.info = FALSE)
so "Example 1" would become:
parallelStart()
f = function(i) i + 5
y = parallelMap(f, 1:2)
parallelStop()
You can later always overwrite settings be explicitly passing them to parallelStart()
, so
parallelStart(cpus=2)
f = function(i) i + 5
y = parallelMap(f, 1:2)
parallelStop()
would use your default "multicore" mode and still disable parallelMap()
's info messages on the console, but decrease cpu usage to 2.
The following options are currently available:
parallelMap.default.mode = "local" / "multicore" / "socket" / "mpi" / "BatchJobs"
parallelMap.default.cpus = <integer>
parallelMap.default.level = <string> or NA
parallelMap.default.socket.hosts = character vector of host names where to spawn in socket mode
parallelMap.default.show.info = TRUE / FALSE
parallelMap.default.logging = TRUE / FALSE
parallelMap.default.storagedir = <path>, must be on a shared file system for master / slaves
For their precise meaning please read the documentation of parallelStart()
.
Sometimes it is useful to have more control over which parallelMap()
operation is actually parallelized.
You can tag parallelMap operations with a so-called "level", basically a name
or category associated with the operation. Usually you would do this in a client package, but you can also do it in custom code.
For packages, register the level(s) that you define in zzz.R
to tell parallelMap
about them.
Here is an example from mlr's
zzz.R
where we call this in .onAttach()
.onAttach = function(libname, pkgname) {
# ...
parallelRegisterLevels(package = "mlr", levels = c("benchmark", "resample", "selectFeatures", "tuneParams"))
}
Later on the user can ask what levels are available, for example
library(mlr)
parallelGetRegisteredLevels()
> mlr: mlr.benchmark, mlr.resample, mlr.selectFeatures, mlr.tuneParams
The output shows the registered levels for each package; in this example, only one package is loaded that provides levels.
In the client package, the tagging of the parallelMap
operation is done through
the level
argument:
parallelMap(myfun, 1:n, level = "package.levelname")
In mlr, we tag parallel operations with such a level, e.g., here.
The user of the package can now set the level when starting the parallel backend, again through the level
argument:
parallelStartSocket(ncpus = 2L, level = "package.levelname")
Parallelization is now performed as follows:
- If no level is set in
parallelStart()
, the first encounteredparallelMap()
call on the master is parallelized, whether it has a tag or not. - If a level is set in the call to
parallelStart()
, only theparallelMap()
calls which have exactly this level set and run on the master are parallelised. - No further parallelization is done if we are already on a slave, i.e. if the
parent call has already been parallelised through
parallelMap()
.
Please read the documentation of
for more detailed information regarding this topic.