Horovod in Docker

To streamline the installation process on GPU machines, we have published the reference Dockerfile so you can get started with Horovod in minutes. The container includes Examples in the /examples directory.

Pre-built Docker containers with Horovod are available on DockerHub.

Building

Before building, you can modify Dockerfile.gpu to your liking, e.g. select a different CUDA, TensorFlow or Python version.

$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
$ docker build -t horovod:latest horovod-docker-gpu

For users without GPUs available in their environments, we've also published a CPU Dockerfile you can build and run similarly.

Running on a single machine

After the container is built, run it using nvidia-docker.

Note: You can replace horovod:latest with the specific pre-build Docker container with Horovod instead of building it by yourself.

$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py

If you don't run your container in privileged mode, you may see the following message:

[a8c9914754d2:00040] Read -1, expected 131072, errno = 1

You can ignore this message.

Running on multiple machines

Here we describe a simple example involving a shared filesystem /mnt/share using a common port number 12345 for the SSH daemon that will be run on all the containers. /mnt/share/ssh would contain a typical id_rsa and authorized_keys pair that allows passwordless authentication.

Note: These are not hard requirements but they make the example more concise. A shared filesystem can be replaced by rsyncing SSH configuration and code across machines, and a common SSH port can be replaced by machine-specific ports defined in /root/.ssh/ssh_config file.

Primary worker:

host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py

Secondary workers:

host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

Adding Mellanox RDMA support

If you have Mellanox NICs, we recommend that you mount your Mellanox devices (/dev/infiniband) in the container and enable the IPC_LOCK capability for memory registration:

$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
root@c278c88dd552:/examples# ...

You need to specify these additional configuration options on primary and secondary workers.

Running containers with different ports

To run in situations without a common SSH port (e.g., multiple containers on the same host):

Configure your ~/.ssh/config file to assign custom host names and ports for each container

Host host1
  HostName 192.168.1.10
  Port 1234

Host host2
  HostName 192.168.1.10
  Port 2345

Use horovodrun directly as though each container were a separate host with its own IP
```
$ horovodrun -np 8 -H host1:4,host2:4 python keras_mnist_advanced.py
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker.rst

docker.rst

Horovod in Docker

Building

Running on a single machine

Running on multiple machines

Adding Mellanox RDMA support

Running containers with different ports

Files

docker.rst

Latest commit

History

docker.rst

File metadata and controls

Horovod in Docker

Building

Running on a single machine

Running on multiple machines

Adding Mellanox RDMA support

Running containers with different ports