From 912226b4187e2dea7614bfef94af725de62f2345 Mon Sep 17 00:00:00 2001 From: Zaccharie Ramzi Date: Tue, 11 Jan 2022 17:10:48 +0100 Subject: [PATCH] Added a Hydra+wandb+submitit-launcher example (#79) * WIP started coding the wandb hydra example for JZ * added a mnist example in the hydra style * slight corrections in train mnist * added configurations for hydra * added wandb to example * expanded readme for the example * corrected config relative placement * corrected project id * corrected timeout min * made the wandb imports lazy * added more self promotion and potential correction to qos * corrected typo on wandb in config * added the functions in the script to keep the imports of tensorflow and wandb lazy * made compile an integral part of the config * corrected name of sync all package * made sync all script a gist to have more stability * removed need for project id and got it from the env * removed default in project id * added explanation on sbatch -c * added an introduction and some alternatives to the packages introduced * replaced my sub scripts (inadapted) with the custom launcher I developped in similar resources * added spec that my plugin needs to specify wandb mode to offline * specified that wandb requires an account * added the example to the page tree Co-authored-by: zaccharieramzi --- docs/examples/tf/tf_wandb_hydra/README.md | 88 +++++++++++++++++++ .../tf/tf_wandb_hydra/conf/config.yaml | 24 +++++ .../conf/hydra/launcher/base.yaml | 19 ++++ .../tf/tf_wandb_hydra/requirements.txt | 3 + .../examples/tf/tf_wandb_hydra/train_mnist.py | 69 +++++++++++++++ mkdocs.yml | 1 + 6 files changed, 204 insertions(+) create mode 100644 docs/examples/tf/tf_wandb_hydra/README.md create mode 100644 docs/examples/tf/tf_wandb_hydra/conf/config.yaml create mode 100644 docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml create mode 100644 docs/examples/tf/tf_wandb_hydra/requirements.txt create mode 100644 docs/examples/tf/tf_wandb_hydra/train_mnist.py diff --git a/docs/examples/tf/tf_wandb_hydra/README.md b/docs/examples/tf/tf_wandb_hydra/README.md new file mode 100644 index 0000000..f45e3b0 --- /dev/null +++ b/docs/examples/tf/tf_wandb_hydra/README.md @@ -0,0 +1,88 @@ +# [Weights&Biases - Hydra](https://github.com/jean-zay-users/jean-zay-doc/tree/master/docs/examples/tf/tf_wandb_hydra) + +Weights&Biases and Hydra are 2 tools used in Machine Learning Projects. +Weights&Biases allows you to easily save a lot of information about your different experiments in the cloud, like meta data, system data, model weights and of course your different metrics and logs. +Hydra is a configuration management tool that allows you to build command line interfaces and create robust and readable configuration files. +These 2 tools can be used together very elegantly and easily, but their setup on Jean Zay is not straightforward. +In this example, we will show you how to setup both tools on Jean Zay in a TensorFlow example. + +## Installation + +To run this example, you need to clone the jean-zay repo in your `$WORK` dir: +``` +cd $WORK &&\ +git clone https://github.com/jean-zay-users/jean-zay-doc.git +``` + +You can then install the requirements: +``` +module purge +module load tensorflow-gpu/py3/2.6.0 +pip install --user -r $WORK/jean-zay-doc/docs/examples/tf/tf_wandb_hydra/requirements.txt +``` + +## Run +In order to run the example on SLURM you can just issue the following command from the example directory: +``` +python train_mnist.py --multirun hydra/launcher=base +hours=1 +``` + +### SLURM parametrization +Different parameters can be set for the SLURM job, using the `hydra.launcher` config group. +For example to launch a longer job, you can use: +``` +python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' +``` + +If you want to use more gpus: +``` +python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' hydra.launcher.gpus_per_node=4 +``` + +### Weights&Biases +This will require you to create a [Weights&Biases account](https://wandb.ai/). +`wandb` is run offline because the compute nodes are not connected to the internet. +In order to have the results uploaded to the cloud, you need to manually sync them using the `wandb sync run_dir` command. +The run directories are located in `$SCRATCH/wandb/jean-zay-doc`, but this can be changed using the `wandb.dir` config variable. +You can also run a script to sync the runs before they are finished on a front node, for example using the script [here](https://gist.github.com/zaccharieramzi/3e1abc67aefac106ede2883c56ac8e1a). + +### Hydra and submitit outputs +The outputs created by Hydra and submitit are located in the `multirun` directory. +You can change this value by setting the `hydra.dir` config variable. + +### Batch jobs +In order to batch multiple similar jobs you can use the sweep feature of Hydra. +For example, if you want to run multiple training with different batch sizes, you can do the following: +``` +python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 +``` + +This can be extended to the grid search of a Cartesian product for example: +``` +python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 compile.optimizer=rmsprop,adam +``` + +## Similar resources + +- [slurm-hydra-submitit](https://github.com/RaphaelMeudec/slurm-hydra-submitit) presents a similar concept in a more general case for any SLURM cluster, without W&B. In particular, it specifies [how to run specific parameters combinations grid search](https://github.com/RaphaelMeudec/slurm-hydra-submitit#specific-parameters-combinations). +- [jz-hydra-submitit-launcher](https://github.com/zaccharieramzi/jz-hydra-submitit-launcher) a pip installable (`pip install jz-hydra-submitit-launcher`) custom launcher that has the correct default for JZ, and several default configurations: +``` +hydra-submitit-launch train_mnist.py dev hydra.launcher.setup=\["'#SBATCH -C v100-32g'","'export WANDB_MODE=offline'"\] +``` + + +## References +- Weights&Biases: https://wandb.ai/site +- Hydra: https://hydra.cc/ +- Submitit: https://github.com/facebookincubator/submitit +- Hydra submitit launcher: https://hydra.cc/docs/plugins/submitit_launcher/ + +## Alternatives + +To Weights&Biases: +- MLFlow +- Tensorboard + +To Hydra: +- argparse +- click \ No newline at end of file diff --git a/docs/examples/tf/tf_wandb_hydra/conf/config.yaml b/docs/examples/tf/tf_wandb_hydra/conf/config.yaml new file mode 100644 index 0000000..2f31545 --- /dev/null +++ b/docs/examples/tf/tf_wandb_hydra/conf/config.yaml @@ -0,0 +1,24 @@ +data: + n_features: 784 + n_classes: 10 + +model: + input_shape: ${data.n_features} + output_num: ${data.n_classes} + +fit: + epochs: 5 + batch_size: 64 + validation_split: 0.2 + +compile: + optimizer: rmsprop + +wandb: + project: jean-zay-doc + notes: "Hydra-wandb-submitit exp" + tags: + - hydra + - tuto + dir: "${oc.env:SCRATCH,.}/wandb/jean-zay-doc" + mode: null \ No newline at end of file diff --git a/docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml b/docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml new file mode 100644 index 0000000..9e13e0b --- /dev/null +++ b/docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml @@ -0,0 +1,19 @@ +defaults: + - submitit_slurm + +timeout_min: 60 +gpus_per_node: 1 +tasks_per_node: 1 +gres: "gpu:${hydra.launcher.gpus_per_node}" +qos: qos_gpu-dev +cpus_per_gpu: 10 +gpus_per_task: ${hydra.launcher.gpus_per_node} +additional_parameters: + account: ${oc.env:IDRPROJ}@gpu + distribution: "block:block" + hint: nomultithread + time: "${hours}:00:00" +setup: + - "#SBATCH -C v100-32g" # this setup is needed here and not in additional parameters + # because otherwise it will be difficult to remove at run time. + - "export WANDB_MODE=offline" \ No newline at end of file diff --git a/docs/examples/tf/tf_wandb_hydra/requirements.txt b/docs/examples/tf/tf_wandb_hydra/requirements.txt new file mode 100644 index 0000000..b38d6a6 --- /dev/null +++ b/docs/examples/tf/tf_wandb_hydra/requirements.txt @@ -0,0 +1,3 @@ +hydra-core +wandb +hydra-submitit-launcher \ No newline at end of file diff --git a/docs/examples/tf/tf_wandb_hydra/train_mnist.py b/docs/examples/tf/tf_wandb_hydra/train_mnist.py new file mode 100644 index 0000000..32c59e8 --- /dev/null +++ b/docs/examples/tf/tf_wandb_hydra/train_mnist.py @@ -0,0 +1,69 @@ +# all taken from https://www.tensorflow.org/guide/keras/functional +from pathlib import Path + +import hydra +from omegaconf import OmegaConf + + +@hydra.main(config_path='conf', config_name='config') +def train_dense_model_main(cfg): + return train_dense_model(cfg) + + +def train_dense_model(cfg): + # limit imports oustide the call to the function, in order to launch quickly + # when using dask + import tensorflow as tf + from tensorflow import keras + from tensorflow.keras import layers + import wandb + from wandb.keras import WandbCallback + + + def my_model(input_shape=784, output_num=10, activation='relu', hidden_size=64): + inputs = keras.Input(shape=input_shape) + x = layers.Dense(hidden_size, activation=activation)(inputs) + x = layers.Dense(hidden_size, activation=activation)(x) + outputs = layers.Dense(output_num)(x) + return keras.Model(inputs=inputs, outputs=outputs, name='mnist_model') + + def model_compile(model, loss='xent', optimizer='rmsprop'): + if loss == 'xent': + loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True) + model.compile(loss=loss, + optimizer=optimizer, + metrics=['accuracy']) + + def data(n_train=60_000, n_test=10_000, n_features=784, n_classes=10): + # training and inference + # network is not reachable, so we use random data + x_train = tf.random.normal((n_train, n_features), dtype='float32') + x_test = tf.random.normal((n_test, n_features), dtype='float32') + y_train = tf.random.uniform((n_train,), minval=0, maxval=n_classes, dtype='int32') + y_test = tf.random.uniform((n_test,), minval=0, maxval=n_classes, dtype='int32') + return x_train, x_test, y_train, y_test + + + # wandb setup + Path(cfg.wandb.dir).mkdir(exist_ok=True, parents=True) + wandb.init( + config=OmegaConf.to_container(cfg, resolve=True), + **cfg.wandb, + ) + callbacks = [ + WandbCallback(monitor='loss', save_weights_only=True), + ] + + # model building + tf.keras.backend.clear_session() + model = my_model(**cfg.model) + model_compile(model, **cfg.compile) + x_train, x_test, y_train, y_test = data(**cfg.data) + history = model.fit(x_train, y_train, **cfg.fit, callbacks=callbacks) + test_scores = model.evaluate(x_test, y_test, verbose=2) + print('Test loss:', test_scores[0]) + print('Test accuracy:', test_scores[1]) + return True + +if __name__ == '__main__': + train_dense_model_main() diff --git a/mkdocs.yml b/mkdocs.yml index 266761d..41f3fdf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -73,3 +73,4 @@ nav: - Single node: examples/tf/tf_simple/README.md - Distributed with SlurmClusterResolver: examples/tf/tf_distributed/README.md - Distributed with Horovod: examples/tf/tf_mpi/README.md + - Weights&Biases and Hydra: examples/tf/tf_wandb_hydra/README.md