SLURM in Linux Containers
The set of scripts to easily deploy SLURM cluster on one machine using Linux Containers. The goal is SLURM development mostly. Any other ideas/usages :)?
Prerequisites: screen
tool.
- Install Linux Containers (LXC)
- Configure LXC (the following is Ubuntu/Mint specific, for other distributions check its manuals to use the proper paths and configuration files names):
- Setup LXC networking (
/etc/default/lxc-net
):USE_LXC_BRIDGE="true"
LXC_DHCP_CONFILE=/etc/lxc/dnsmasq.conf
LXC_DOMAIN="lxc"
- Change
/etc/lxc/dnsmasq.conf
adding following line:conf-file=$SLXC_PATH/build/dnsmasq.conf
- If facing problems, check https://github.com/lxc/lxc/pull/285/files (look in /etc/apparmor.d/abstractions/lxc/start-container)
- Install Munge in
MUNGE_PATH
(undersomeuser
). NOTE! that munge-0.5.11 has problems with user-defined prefix installation (see https://code.google.com/p/munge/issues/detail?id=34 for the details). In the mentioned issue report you may find the patch that temporally fixes this problem. Or you can use more recent versions that have this problem fixed. - [Optional] If the SLURM_USER is not root and you plan to submit jobs as user USER1 != SLURM_USER:
- Apply the patch from SLURM directory:
patch -p1 < <slxc_path>/patch/start_from_user.patch
- Install SLURM in
SLURM_PATH
(undersomeuser
). Make additional directorys in slurm's prefix:
mkdir $SLURM_PATH/var $SLURM_PATH/etc
- Configure SLURM and put its configuration in
$SLURM_PATH/etc/slurm.conf
. While configuring select your favorite domain names for the frontend and compute nodes. Here we will usefrontend
andcnX
. - Put SLURM and Munge installation paths to
$SLXC_PATH/slxc.conf
. - Set
SLURM_USER
tosomeuser
in$SLXC_PATH/slxc.conf
. - Create cluster machines with
slxc-new-node.sh
. The only argument ofslxc-new-node.sh
is machine hostname. NOTE that you must use the same frontend/compute nodes names as in$SLURM_PATH/etc/slurm.conf
.
- Create frontend first (let's call it "frontend" for example ):
$SLXC_PATH/slxc-new-node.sh frontend
- Create node machines (cn1, cn2, ..., cnN):
$ for i in $(seq 1 N); do $SLX_PATH/slxc-new-node.sh cn$i; done
- [Optional] Add Munge and SLURM installation paths to your PATH environment variable.
And
export SLURM_CONF=$SLURM_PATH/etc/slurm.conf
to letsinfo
,sbatch
and others know how to reachslurmctld
. - Restart lxc-net service (for Ubuntu/Mint):
$ sudo service lxc-net restart
- [Optional] If the SLURM_USER is not root and you plan to submit jobs as user USER1 != SLURM_USER:
- Setup SLURM capabilities:
$ sudo ./slurm-set-capabilities.sh
- Start your cluster:
$ sudo ./slxc-run-cluster.sh
- Verify that everything is OK (both tools should show all your virtual "machines" running):
$ sudo screen -ls
$ sudo lxc-ls --active
- Now you can attach to any machine with
$ sudo lxc-attach -n $nodename
- To shutdown your cluster use
$ ./slxc-stop-cluster.sh
- NOTE: that it may take a while. You can speedup this process by setting
LXC_SHUTDOWN_TIMEOUT
in/etc/default/lxc
(for Ubuntu and Mint)
That seems to be all. Enjoy!