Skip to content

Commit

Permalink
Added a Hydra+wandb+submitit-launcher example (#79)
Browse files Browse the repository at this point in the history
* WIP started coding the wandb hydra example for JZ

* added a mnist example in the hydra style

* slight corrections in train mnist

* added configurations for hydra

* added wandb to example

* expanded readme for the example

* corrected config relative placement

* corrected project id

* corrected timeout min

* made the wandb imports lazy

* added more self promotion and potential correction to qos

* corrected typo on wandb in config

* added the functions in the script to keep the imports of tensorflow and wandb lazy

* made compile an integral part of the config

* corrected name of sync all package

* made sync all script a gist to have more stability

* removed need for project id and got it from the env

* removed default in project id

* added explanation on sbatch -c

* added an introduction and some alternatives to the packages introduced

* replaced my sub scripts (inadapted) with the custom launcher I developped in similar resources

* added spec that my plugin needs to specify wandb mode to offline

* specified that wandb requires an account

* added the example to the page tree

Co-authored-by: zaccharieramzi <[email protected]>
  • Loading branch information
zaccharieramzi and zaccharieramzi authored Jan 11, 2022
1 parent 3de69fe commit 912226b
Show file tree
Hide file tree
Showing 6 changed files with 204 additions and 0 deletions.
88 changes: 88 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# [Weights&Biases - Hydra](https://github.com/jean-zay-users/jean-zay-doc/tree/master/docs/examples/tf/tf_wandb_hydra)

Weights&Biases and Hydra are 2 tools used in Machine Learning Projects.
Weights&Biases allows you to easily save a lot of information about your different experiments in the cloud, like meta data, system data, model weights and of course your different metrics and logs.
Hydra is a configuration management tool that allows you to build command line interfaces and create robust and readable configuration files.
These 2 tools can be used together very elegantly and easily, but their setup on Jean Zay is not straightforward.
In this example, we will show you how to setup both tools on Jean Zay in a TensorFlow example.

## Installation

To run this example, you need to clone the jean-zay repo in your `$WORK` dir:
```
cd $WORK &&\
git clone https://github.com/jean-zay-users/jean-zay-doc.git
```

You can then install the requirements:
```
module purge
module load tensorflow-gpu/py3/2.6.0
pip install --user -r $WORK/jean-zay-doc/docs/examples/tf/tf_wandb_hydra/requirements.txt
```

## Run
In order to run the example on SLURM you can just issue the following command from the example directory:
```
python train_mnist.py --multirun hydra/launcher=base +hours=1
```

### SLURM parametrization
Different parameters can be set for the SLURM job, using the `hydra.launcher` config group.
For example to launch a longer job, you can use:
```
python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3'
```

If you want to use more gpus:
```
python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' hydra.launcher.gpus_per_node=4
```

### Weights&Biases
This will require you to create a [Weights&Biases account](https://wandb.ai/).
`wandb` is run offline because the compute nodes are not connected to the internet.
In order to have the results uploaded to the cloud, you need to manually sync them using the `wandb sync run_dir` command.
The run directories are located in `$SCRATCH/wandb/jean-zay-doc`, but this can be changed using the `wandb.dir` config variable.
You can also run a script to sync the runs before they are finished on a front node, for example using the script [here](https://gist.github.com/zaccharieramzi/3e1abc67aefac106ede2883c56ac8e1a).

### Hydra and submitit outputs
The outputs created by Hydra and submitit are located in the `multirun` directory.
You can change this value by setting the `hydra.dir` config variable.

### Batch jobs
In order to batch multiple similar jobs you can use the sweep feature of Hydra.
For example, if you want to run multiple training with different batch sizes, you can do the following:
```
python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128
```

This can be extended to the grid search of a Cartesian product for example:
```
python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 compile.optimizer=rmsprop,adam
```

## Similar resources

- [slurm-hydra-submitit](https://github.com/RaphaelMeudec/slurm-hydra-submitit) presents a similar concept in a more general case for any SLURM cluster, without W&B. In particular, it specifies [how to run specific parameters combinations grid search](https://github.com/RaphaelMeudec/slurm-hydra-submitit#specific-parameters-combinations).
- [jz-hydra-submitit-launcher](https://github.com/zaccharieramzi/jz-hydra-submitit-launcher) a pip installable (`pip install jz-hydra-submitit-launcher`) custom launcher that has the correct default for JZ, and several default configurations:
```
hydra-submitit-launch train_mnist.py dev hydra.launcher.setup=\["'#SBATCH -C v100-32g'","'export WANDB_MODE=offline'"\]
```


## References
- Weights&Biases: https://wandb.ai/site
- Hydra: https://hydra.cc/
- Submitit: https://github.com/facebookincubator/submitit
- Hydra submitit launcher: https://hydra.cc/docs/plugins/submitit_launcher/

## Alternatives

To Weights&Biases:
- MLFlow
- Tensorboard

To Hydra:
- argparse
- click
24 changes: 24 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/conf/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
data:
n_features: 784
n_classes: 10

model:
input_shape: ${data.n_features}
output_num: ${data.n_classes}

fit:
epochs: 5
batch_size: 64
validation_split: 0.2

compile:
optimizer: rmsprop

wandb:
project: jean-zay-doc
notes: "Hydra-wandb-submitit exp"
tags:
- hydra
- tuto
dir: "${oc.env:SCRATCH,.}/wandb/jean-zay-doc"
mode: null
19 changes: 19 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
defaults:
- submitit_slurm

timeout_min: 60
gpus_per_node: 1
tasks_per_node: 1
gres: "gpu:${hydra.launcher.gpus_per_node}"
qos: qos_gpu-dev
cpus_per_gpu: 10
gpus_per_task: ${hydra.launcher.gpus_per_node}
additional_parameters:
account: ${oc.env:IDRPROJ}@gpu
distribution: "block:block"
hint: nomultithread
time: "${hours}:00:00"
setup:
- "#SBATCH -C v100-32g" # this setup is needed here and not in additional parameters
# because otherwise it will be difficult to remove at run time.
- "export WANDB_MODE=offline"
3 changes: 3 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
hydra-core
wandb
hydra-submitit-launcher
69 changes: 69 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/train_mnist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# all taken from https://www.tensorflow.org/guide/keras/functional
from pathlib import Path

import hydra
from omegaconf import OmegaConf


@hydra.main(config_path='conf', config_name='config')
def train_dense_model_main(cfg):
return train_dense_model(cfg)


def train_dense_model(cfg):
# limit imports oustide the call to the function, in order to launch quickly
# when using dask
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import wandb
from wandb.keras import WandbCallback


def my_model(input_shape=784, output_num=10, activation='relu', hidden_size=64):
inputs = keras.Input(shape=input_shape)
x = layers.Dense(hidden_size, activation=activation)(inputs)
x = layers.Dense(hidden_size, activation=activation)(x)
outputs = layers.Dense(output_num)(x)
return keras.Model(inputs=inputs, outputs=outputs, name='mnist_model')

def model_compile(model, loss='xent', optimizer='rmsprop'):
if loss == 'xent':
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss,
optimizer=optimizer,
metrics=['accuracy'])

def data(n_train=60_000, n_test=10_000, n_features=784, n_classes=10):
# training and inference
# network is not reachable, so we use random data
x_train = tf.random.normal((n_train, n_features), dtype='float32')
x_test = tf.random.normal((n_test, n_features), dtype='float32')
y_train = tf.random.uniform((n_train,), minval=0, maxval=n_classes, dtype='int32')
y_test = tf.random.uniform((n_test,), minval=0, maxval=n_classes, dtype='int32')
return x_train, x_test, y_train, y_test


# wandb setup
Path(cfg.wandb.dir).mkdir(exist_ok=True, parents=True)
wandb.init(
config=OmegaConf.to_container(cfg, resolve=True),
**cfg.wandb,
)
callbacks = [
WandbCallback(monitor='loss', save_weights_only=True),
]

# model building
tf.keras.backend.clear_session()
model = my_model(**cfg.model)
model_compile(model, **cfg.compile)
x_train, x_test, y_train, y_test = data(**cfg.data)
history = model.fit(x_train, y_train, **cfg.fit, callbacks=callbacks)
test_scores = model.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])
return True

if __name__ == '__main__':
train_dense_model_main()
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,4 @@ nav:
- Single node: examples/tf/tf_simple/README.md
- Distributed with SlurmClusterResolver: examples/tf/tf_distributed/README.md
- Distributed with Horovod: examples/tf/tf_mpi/README.md
- Weights&Biases and Hydra: examples/tf/tf_wandb_hydra/README.md

0 comments on commit 912226b

Please sign in to comment.