Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a Hydra+wandb+submitit-launcher example #79

Merged
merged 24 commits into from
Jan 11, 2022
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5aaae95
WIP started coding the wandb hydra example for JZ
Jan 6, 2022
ad57780
added a mnist example in the hydra style
Jan 6, 2022
786a614
slight corrections in train mnist
Jan 6, 2022
ed9d2c7
added configurations for hydra
Jan 6, 2022
1249c93
added wandb to example
Jan 6, 2022
a5a5c1a
expanded readme for the example
Jan 6, 2022
a255edf
corrected config relative placement
zaccharieramzi Jan 7, 2022
8e31a47
corrected project id
zaccharieramzi Jan 7, 2022
193e6c6
corrected timeout min
zaccharieramzi Jan 7, 2022
9b5ce42
made the wandb imports lazy
zaccharieramzi Jan 7, 2022
4f8a1d2
added more self promotion and potential correction to qos
zaccharieramzi Jan 7, 2022
349ef11
corrected typo on wandb in config
zaccharieramzi Jan 7, 2022
4859905
added the functions in the script to keep the imports of tensorflow a…
zaccharieramzi Jan 7, 2022
12a880c
made compile an integral part of the config
zaccharieramzi Jan 7, 2022
987ed8c
corrected name of sync all package
zaccharieramzi Jan 7, 2022
560e509
made sync all script a gist to have more stability
zaccharieramzi Jan 7, 2022
025662e
removed need for project id and got it from the env
zaccharieramzi Jan 7, 2022
7927128
removed default in project id
zaccharieramzi Jan 7, 2022
a23b913
added explanation on sbatch -c
zaccharieramzi Jan 10, 2022
7cd061c
added an introduction and some alternatives to the packages introduced
zaccharieramzi Jan 10, 2022
b4cde5a
replaced my sub scripts (inadapted) with the custom launcher I develo…
zaccharieramzi Jan 10, 2022
b0903c3
added spec that my plugin needs to specify wandb mode to offline
zaccharieramzi Jan 10, 2022
fc38216
specified that wandb requires an account
Jan 11, 2022
750322a
added the example to the page tree
Jan 11, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# [Weights&Biases - Hydra](https://github.com/jean-zay-users/jean-zay-doc/tree/master/docs/examples/tf/tf_wandb_hydra)

zaccharieramzi marked this conversation as resolved.
Show resolved Hide resolved
Weights&Biases and Hydra are 2 tools used in Machine Learning Projects.
Weights&Biases allows you to easily save a lot of information about your different experiments in the cloud, like meta data, system data, model weights and of course your different metrics and logs.
Hydra is a configuration management tool that allows you to build command line interfaces and create robust and readable configuration files.
These 2 tools can be used together very elegantly and easily, but their setup on Jean Zay is not straightforward.
In this example, we will show you how to setup both tools on Jean Zay in a TensorFlow example.

## Installation

To run this example, you need to clone the jean-zay repo in your `$WORK` dir:
```
cd $WORK &&\
git clone https://github.com/jean-zay-users/jean-zay-doc.git
```

You can then install the requirements:
```
module purge
module load tensorflow-gpu/py3/2.6.0
pip install --user -r $WORK/jean-zay-doc/docs/examples/tf/tf_wandb_hydra/requirements.txt
```

## Run
In order to run the example on SLURM you can just issue the following command from the example directory:
```
python train_mnist.py --multirun hydra/launcher=base +hours=1
```

### SLURM parametrization
Different parameters can be set for the SLURM job, using the `hydra.launcher` config group.
For example to launch a longer job, you can use:
```
python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3'
```

If you want to use more gpus:
```
python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' hydra.launcher.gpus_per_node=4
```

### Weights&Biases
`wandb` is run offline because the compute nodes are not connected to the internet.
In order to have the results uploaded to the cloud, you need to manually sync them using the `wandb sync run_dir` command.
mdiazmel marked this conversation as resolved.
Show resolved Hide resolved
The run directories are located in `$SCRATCH/wandb/jean-zay-doc`, but this can be changed using the `wandb.dir` config variable.
You can also run a script to sync the runs before they are finished on a front node, for example using the script [here](https://gist.github.com/zaccharieramzi/3e1abc67aefac106ede2883c56ac8e1a).

### Hydra and submitit outputs
The outputs created by Hydra and submitit are located in the `multirun` directory.
You can change this value by setting the `hydra.dir` config variable.

### Batch jobs
In order to batch multiple similar jobs you can use the sweep feature of Hydra.
For example, if you want to run multiple training with different batch sizes, you can do the following:
```
python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128
```

This can be extended to the grid search of a Cartesian product for example:
```
python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 compile.optimizer=rmsprop,adam
```

## Similar resources

- [slurm-hydra-submitit](https://github.com/RaphaelMeudec/slurm-hydra-submitit) presents a similar concept in a more general case for any SLURM cluster, without W&B. In particular, it specifies [how to run specific parameters combinations grid search](https://github.com/RaphaelMeudec/slurm-hydra-submitit#specific-parameters-combinations).
- [jz-hydra-submitit-launcher](https://github.com/zaccharieramzi/jz-hydra-submitit-launcher) a pip installable (`pip install jz-hydra-submitit-launcher`) custom launcher that has the correct default for JZ, and several default configurations:
```
hydra-submitit-launch train_mnist.py dev hydra.launcher.setup=\["'#SBATCH -C v100-32g'","'export WANDB_MODE=offline'"\]
```


## References
- Weights&Biases: https://wandb.ai/site
- Hydra: https://hydra.cc/
- Submitit: https://github.com/facebookincubator/submitit
- Hydra submitit launcher: https://hydra.cc/docs/plugins/submitit_launcher/

## Alternatives

To Weights&Biases:
- MLFlow
- Tensorboard

To Hydra:
- argparse
- click
mdiazmel marked this conversation as resolved.
Show resolved Hide resolved
24 changes: 24 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/conf/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
data:
n_features: 784
n_classes: 10

model:
input_shape: ${data.n_features}
output_num: ${data.n_classes}

fit:
epochs: 5
batch_size: 64
validation_split: 0.2

compile:
optimizer: rmsprop

wandb:
project: jean-zay-doc
notes: "Hydra-wandb-submitit exp"
tags:
- hydra
- tuto
dir: "${oc.env:SCRATCH,.}/wandb/jean-zay-doc"
mode: null
19 changes: 19 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
defaults:
- submitit_slurm

timeout_min: 60
gpus_per_node: 1
tasks_per_node: 1
gres: "gpu:${hydra.launcher.gpus_per_node}"
qos: qos_gpu-dev
cpus_per_gpu: 10
gpus_per_task: ${hydra.launcher.gpus_per_node}
additional_parameters:
account: ${oc.env:IDRPROJ}@gpu
distribution: "block:block"
hint: nomultithread
time: "${hours}:00:00"
setup:
- "#SBATCH -C v100-32g" # this setup is needed here and not in additional parameters
# because otherwise it will be difficult to remove at run time.
- "export WANDB_MODE=offline"
3 changes: 3 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
hydra-core
wandb
hydra-submitit-launcher
69 changes: 69 additions & 0 deletions docs/examples/tf/tf_wandb_hydra/train_mnist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# all taken from https://www.tensorflow.org/guide/keras/functional
from pathlib import Path

import hydra
from omegaconf import OmegaConf


@hydra.main(config_path='conf', config_name='config')
def train_dense_model_main(cfg):
return train_dense_model(cfg)


def train_dense_model(cfg):
# limit imports oustide the call to the function, in order to launch quickly
# when using dask
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import wandb
from wandb.keras import WandbCallback


def my_model(input_shape=784, output_num=10, activation='relu', hidden_size=64):
inputs = keras.Input(shape=input_shape)
x = layers.Dense(hidden_size, activation=activation)(inputs)
x = layers.Dense(hidden_size, activation=activation)(x)
outputs = layers.Dense(output_num)(x)
return keras.Model(inputs=inputs, outputs=outputs, name='mnist_model')

def model_compile(model, loss='xent', optimizer='rmsprop'):
if loss == 'xent':
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss,
optimizer=optimizer,
metrics=['accuracy'])

def data(n_train=60_000, n_test=10_000, n_features=784, n_classes=10):
# training and inference
# network is not reachable, so we use random data
x_train = tf.random.normal((n_train, n_features), dtype='float32')
x_test = tf.random.normal((n_test, n_features), dtype='float32')
y_train = tf.random.uniform((n_train,), minval=0, maxval=n_classes, dtype='int32')
y_test = tf.random.uniform((n_test,), minval=0, maxval=n_classes, dtype='int32')
return x_train, x_test, y_train, y_test


# wandb setup
Path(cfg.wandb.dir).mkdir(exist_ok=True, parents=True)
wandb.init(
config=OmegaConf.to_container(cfg, resolve=True),
**cfg.wandb,
)
callbacks = [
WandbCallback(monitor='loss', save_weights_only=True),
]

# model building
tf.keras.backend.clear_session()
model = my_model(**cfg.model)
model_compile(model, **cfg.compile)
x_train, x_test, y_train, y_test = data(**cfg.data)
history = model.fit(x_train, y_train, **cfg.fit, callbacks=callbacks)
test_scores = model.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])
return True

if __name__ == '__main__':
train_dense_model_main()