Added a Hydra+wandb+submitit-launcher example (#79)

* WIP started coding the wandb hydra example for JZ * added a mnist example in the hydra style * slight corrections in train mnist * added configurations for hydra * added wandb to example * expanded readme for the example * corrected config relative placement * corrected project id * corrected timeout min * made the wandb imports lazy * added more self promotion and potential correction to qos * corrected typo on wandb in config * added the functions in the script to keep the imports of tensorflow and wandb lazy * made compile an integral part of the config * corrected name of sync all package * made sync all script a gist to have more stability * removed need for project id and got it from the env * removed default in project id * added explanation on sbatch -c * added an introduction and some alternatives to the packages introduced * replaced my sub scripts (inadapted) with the custom launcher I developped in similar resources * added spec that my plugin needs to specify wandb mode to offline * specified that wandb requires an account * added the example to the page tree Co-authored-by: zaccharieramzi <[email protected]>
jean-zay-users · Jan 11, 2022 · 912226b · 912226b
1 parent 3de69fe
commit 912226b
Show file tree

Hide file tree

Showing 6 changed files with 204 additions and 0 deletions.
diff --git a/docs/examples/tf/tf_wandb_hydra/README.md b/docs/examples/tf/tf_wandb_hydra/README.md
@@ -0,0 +1,88 @@
+# [Weights&Biases - Hydra](https://github.com/jean-zay-users/jean-zay-doc/tree/master/docs/examples/tf/tf_wandb_hydra)
+
+Weights&Biases and Hydra are 2 tools used in Machine Learning Projects.
+Weights&Biases allows you to easily save a lot of information about your different experiments in the cloud, like meta data, system data, model weights and of course your different metrics and logs.
+Hydra is a configuration management tool that allows you to build command line interfaces and create robust and readable configuration files.
+These 2 tools can be used together very elegantly and easily, but their setup on Jean Zay is not straightforward.
+In this example, we will show you how to setup both tools on Jean Zay in a TensorFlow example.
+
+## Installation
+
+To run this example, you need to clone the jean-zay repo in your `$WORK` dir:
+```
+cd $WORK &&\
+git clone https://github.com/jean-zay-users/jean-zay-doc.git
+```
+
+You can then install the requirements:
+```
+module purge
+module load tensorflow-gpu/py3/2.6.0
+pip install --user -r $WORK/jean-zay-doc/docs/examples/tf/tf_wandb_hydra/requirements.txt
+```
+
+## Run
+In order to run the example on SLURM you can just issue the following command from the example directory:
+```
+python train_mnist.py --multirun hydra/launcher=base +hours=1
+```
+
+### SLURM parametrization
+Different parameters can be set for the SLURM job, using the `hydra.launcher` config group.
+For example to launch a longer job, you can use:
+```
+python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3'
+```
+
+If you want to use more gpus:
+```
+python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' hydra.launcher.gpus_per_node=4
+```
+
+### Weights&Biases
+This will require you to create a [Weights&Biases account](https://wandb.ai/).
+`wandb` is run offline because the compute nodes are not connected to the internet.
+In order to have the results uploaded to the cloud, you need to manually sync them using the `wandb sync run_dir` command.
+The run directories are located in `$SCRATCH/wandb/jean-zay-doc`, but this can be changed using the `wandb.dir` config variable.
+You can also run a script to sync the runs before they are finished on a front node, for example using the script [here](https://gist.github.com/zaccharieramzi/3e1abc67aefac106ede2883c56ac8e1a).
+
+### Hydra and submitit outputs
+The outputs created by Hydra and submitit are located in the `multirun` directory.
+You can change this value by setting the `hydra.dir` config variable.
+
+### Batch jobs
+In order to batch multiple similar jobs you can use the sweep feature of Hydra.
+For example, if you want to run multiple training with different batch sizes, you can do the following:
+```
+python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128
+```
+
+This can be extended to the grid search of a Cartesian product for example:
+```
+python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 compile.optimizer=rmsprop,adam
+```
+
+## Similar resources
+
+- [slurm-hydra-submitit](https://github.com/RaphaelMeudec/slurm-hydra-submitit) presents a similar concept in a more general case for any SLURM cluster, without W&B. In particular, it specifies [how to run specific parameters combinations grid search](https://github.com/RaphaelMeudec/slurm-hydra-submitit#specific-parameters-combinations).
+- [jz-hydra-submitit-launcher](https://github.com/zaccharieramzi/jz-hydra-submitit-launcher) a pip installable (`pip install jz-hydra-submitit-launcher`) custom launcher that has the correct default for JZ, and several default configurations:
+```
+hydra-submitit-launch train_mnist.py dev hydra.launcher.setup=\["'#SBATCH -C v100-32g'","'export WANDB_MODE=offline'"\]
+```
+
+
+## References
+- Weights&Biases: https://wandb.ai/site
+- Hydra: https://hydra.cc/
+- Submitit: https://github.com/facebookincubator/submitit
+- Hydra submitit launcher: https://hydra.cc/docs/plugins/submitit_launcher/
+
+## Alternatives
+
+To Weights&Biases:
+- MLFlow
+- Tensorboard
+
+To Hydra:
+- argparse
+- click
diff --git a/docs/examples/tf/tf_wandb_hydra/conf/config.yaml b/docs/examples/tf/tf_wandb_hydra/conf/config.yaml
@@ -0,0 +1,24 @@
+data:
+  n_features: 784
+  n_classes: 10
+
+model:
+  input_shape: ${data.n_features}
+  output_num: ${data.n_classes}
+
+fit:
+  epochs: 5
+  batch_size: 64
+  validation_split: 0.2
+
+compile:
+  optimizer: rmsprop
+
+wandb:
+  project: jean-zay-doc
+  notes: "Hydra-wandb-submitit exp"
+  tags:
+    - hydra
+    - tuto
+  dir: "${oc.env:SCRATCH,.}/wandb/jean-zay-doc"
+  mode: null
diff --git a/docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml b/docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml
@@ -0,0 +1,19 @@
+defaults:
+  - submitit_slurm
+
+timeout_min: 60
+gpus_per_node: 1
+tasks_per_node: 1
+gres: "gpu:${hydra.launcher.gpus_per_node}"
+qos: qos_gpu-dev
+cpus_per_gpu: 10
+gpus_per_task: ${hydra.launcher.gpus_per_node}
+additional_parameters:
+  account: ${oc.env:IDRPROJ}@gpu
+  distribution: "block:block"
+  hint: nomultithread
+  time: "${hours}:00:00"
+setup:
+  - "#SBATCH -C v100-32g"  # this setup is needed here and not in additional parameters
+  # because otherwise it will be difficult to remove at run time.
+  - "export WANDB_MODE=offline"
diff --git a/docs/examples/tf/tf_wandb_hydra/requirements.txt b/docs/examples/tf/tf_wandb_hydra/requirements.txt
@@ -0,0 +1,3 @@
+hydra-core
+wandb
+hydra-submitit-launcher
diff --git a/docs/examples/tf/tf_wandb_hydra/train_mnist.py b/docs/examples/tf/tf_wandb_hydra/train_mnist.py
@@ -0,0 +1,69 @@
+# all taken from https://www.tensorflow.org/guide/keras/functional
+from pathlib import Path
+
+import hydra
+from omegaconf import OmegaConf
+
+
+@hydra.main(config_path='conf', config_name='config')
+def train_dense_model_main(cfg):
+    return train_dense_model(cfg)
+
+
+def train_dense_model(cfg):
+    # limit imports oustide the call to the function, in order to launch quickly
+    # when using dask
+    import tensorflow as tf
+    from tensorflow import keras
+    from tensorflow.keras import layers
+    import wandb
+    from wandb.keras import WandbCallback
+
+
+    def my_model(input_shape=784, output_num=10, activation='relu', hidden_size=64):
+        inputs = keras.Input(shape=input_shape)
+        x = layers.Dense(hidden_size, activation=activation)(inputs)
+        x = layers.Dense(hidden_size, activation=activation)(x)
+        outputs = layers.Dense(output_num)(x)
+        return keras.Model(inputs=inputs, outputs=outputs, name='mnist_model')
+
+    def model_compile(model, loss='xent', optimizer='rmsprop'):
+        if loss == 'xent':
+            loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+        model.compile(loss=loss,
+                    optimizer=optimizer,
+                    metrics=['accuracy'])
+
+    def data(n_train=60_000, n_test=10_000, n_features=784, n_classes=10):
+        # training and inference
+        # network is not reachable, so we use random data
+        x_train = tf.random.normal((n_train, n_features), dtype='float32')
+        x_test = tf.random.normal((n_test, n_features), dtype='float32')
+        y_train = tf.random.uniform((n_train,), minval=0, maxval=n_classes, dtype='int32')
+        y_test = tf.random.uniform((n_test,), minval=0, maxval=n_classes, dtype='int32')
+        return x_train, x_test, y_train, y_test
+
+
+    # wandb setup
+    Path(cfg.wandb.dir).mkdir(exist_ok=True, parents=True)
+    wandb.init(
+        config=OmegaConf.to_container(cfg, resolve=True),
+        **cfg.wandb,
+    )
+    callbacks = [
+        WandbCallback(monitor='loss', save_weights_only=True),
+    ]
+
+    # model building
+    tf.keras.backend.clear_session()
+    model = my_model(**cfg.model)
+    model_compile(model, **cfg.compile)
+    x_train, x_test, y_train, y_test = data(**cfg.data)
+    history = model.fit(x_train, y_train, **cfg.fit, callbacks=callbacks)
+    test_scores = model.evaluate(x_test, y_test, verbose=2)
+    print('Test loss:', test_scores[0])
+    print('Test accuracy:', test_scores[1])
+    return True
+
+if __name__ == '__main__':
+    train_dense_model_main()
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -73,3 +73,4 @@ nav:
       - Single node: examples/tf/tf_simple/README.md
       - Distributed with SlurmClusterResolver: examples/tf/tf_distributed/README.md
       - Distributed with Horovod: examples/tf/tf_mpi/README.md
+      - Weights&Biases and Hydra: examples/tf/tf_wandb_hydra/README.md