-
Notifications
You must be signed in to change notification settings - Fork 34
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added a Hydra+wandb+submitit-launcher example (#79)
* WIP started coding the wandb hydra example for JZ * added a mnist example in the hydra style * slight corrections in train mnist * added configurations for hydra * added wandb to example * expanded readme for the example * corrected config relative placement * corrected project id * corrected timeout min * made the wandb imports lazy * added more self promotion and potential correction to qos * corrected typo on wandb in config * added the functions in the script to keep the imports of tensorflow and wandb lazy * made compile an integral part of the config * corrected name of sync all package * made sync all script a gist to have more stability * removed need for project id and got it from the env * removed default in project id * added explanation on sbatch -c * added an introduction and some alternatives to the packages introduced * replaced my sub scripts (inadapted) with the custom launcher I developped in similar resources * added spec that my plugin needs to specify wandb mode to offline * specified that wandb requires an account * added the example to the page tree Co-authored-by: zaccharieramzi <[email protected]>
- Loading branch information
1 parent
3de69fe
commit 912226b
Showing
6 changed files
with
204 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# [Weights&Biases - Hydra](https://github.com/jean-zay-users/jean-zay-doc/tree/master/docs/examples/tf/tf_wandb_hydra) | ||
|
||
Weights&Biases and Hydra are 2 tools used in Machine Learning Projects. | ||
Weights&Biases allows you to easily save a lot of information about your different experiments in the cloud, like meta data, system data, model weights and of course your different metrics and logs. | ||
Hydra is a configuration management tool that allows you to build command line interfaces and create robust and readable configuration files. | ||
These 2 tools can be used together very elegantly and easily, but their setup on Jean Zay is not straightforward. | ||
In this example, we will show you how to setup both tools on Jean Zay in a TensorFlow example. | ||
|
||
## Installation | ||
|
||
To run this example, you need to clone the jean-zay repo in your `$WORK` dir: | ||
``` | ||
cd $WORK &&\ | ||
git clone https://github.com/jean-zay-users/jean-zay-doc.git | ||
``` | ||
|
||
You can then install the requirements: | ||
``` | ||
module purge | ||
module load tensorflow-gpu/py3/2.6.0 | ||
pip install --user -r $WORK/jean-zay-doc/docs/examples/tf/tf_wandb_hydra/requirements.txt | ||
``` | ||
|
||
## Run | ||
In order to run the example on SLURM you can just issue the following command from the example directory: | ||
``` | ||
python train_mnist.py --multirun hydra/launcher=base +hours=1 | ||
``` | ||
|
||
### SLURM parametrization | ||
Different parameters can be set for the SLURM job, using the `hydra.launcher` config group. | ||
For example to launch a longer job, you can use: | ||
``` | ||
python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' | ||
``` | ||
|
||
If you want to use more gpus: | ||
``` | ||
python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' hydra.launcher.gpus_per_node=4 | ||
``` | ||
|
||
### Weights&Biases | ||
This will require you to create a [Weights&Biases account](https://wandb.ai/). | ||
`wandb` is run offline because the compute nodes are not connected to the internet. | ||
In order to have the results uploaded to the cloud, you need to manually sync them using the `wandb sync run_dir` command. | ||
The run directories are located in `$SCRATCH/wandb/jean-zay-doc`, but this can be changed using the `wandb.dir` config variable. | ||
You can also run a script to sync the runs before they are finished on a front node, for example using the script [here](https://gist.github.com/zaccharieramzi/3e1abc67aefac106ede2883c56ac8e1a). | ||
|
||
### Hydra and submitit outputs | ||
The outputs created by Hydra and submitit are located in the `multirun` directory. | ||
You can change this value by setting the `hydra.dir` config variable. | ||
|
||
### Batch jobs | ||
In order to batch multiple similar jobs you can use the sweep feature of Hydra. | ||
For example, if you want to run multiple training with different batch sizes, you can do the following: | ||
``` | ||
python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 | ||
``` | ||
|
||
This can be extended to the grid search of a Cartesian product for example: | ||
``` | ||
python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 compile.optimizer=rmsprop,adam | ||
``` | ||
|
||
## Similar resources | ||
|
||
- [slurm-hydra-submitit](https://github.com/RaphaelMeudec/slurm-hydra-submitit) presents a similar concept in a more general case for any SLURM cluster, without W&B. In particular, it specifies [how to run specific parameters combinations grid search](https://github.com/RaphaelMeudec/slurm-hydra-submitit#specific-parameters-combinations). | ||
- [jz-hydra-submitit-launcher](https://github.com/zaccharieramzi/jz-hydra-submitit-launcher) a pip installable (`pip install jz-hydra-submitit-launcher`) custom launcher that has the correct default for JZ, and several default configurations: | ||
``` | ||
hydra-submitit-launch train_mnist.py dev hydra.launcher.setup=\["'#SBATCH -C v100-32g'","'export WANDB_MODE=offline'"\] | ||
``` | ||
|
||
|
||
## References | ||
- Weights&Biases: https://wandb.ai/site | ||
- Hydra: https://hydra.cc/ | ||
- Submitit: https://github.com/facebookincubator/submitit | ||
- Hydra submitit launcher: https://hydra.cc/docs/plugins/submitit_launcher/ | ||
|
||
## Alternatives | ||
|
||
To Weights&Biases: | ||
- MLFlow | ||
- Tensorboard | ||
|
||
To Hydra: | ||
- argparse | ||
- click |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
data: | ||
n_features: 784 | ||
n_classes: 10 | ||
|
||
model: | ||
input_shape: ${data.n_features} | ||
output_num: ${data.n_classes} | ||
|
||
fit: | ||
epochs: 5 | ||
batch_size: 64 | ||
validation_split: 0.2 | ||
|
||
compile: | ||
optimizer: rmsprop | ||
|
||
wandb: | ||
project: jean-zay-doc | ||
notes: "Hydra-wandb-submitit exp" | ||
tags: | ||
- hydra | ||
- tuto | ||
dir: "${oc.env:SCRATCH,.}/wandb/jean-zay-doc" | ||
mode: null |
19 changes: 19 additions & 0 deletions
19
docs/examples/tf/tf_wandb_hydra/conf/hydra/launcher/base.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
defaults: | ||
- submitit_slurm | ||
|
||
timeout_min: 60 | ||
gpus_per_node: 1 | ||
tasks_per_node: 1 | ||
gres: "gpu:${hydra.launcher.gpus_per_node}" | ||
qos: qos_gpu-dev | ||
cpus_per_gpu: 10 | ||
gpus_per_task: ${hydra.launcher.gpus_per_node} | ||
additional_parameters: | ||
account: ${oc.env:IDRPROJ}@gpu | ||
distribution: "block:block" | ||
hint: nomultithread | ||
time: "${hours}:00:00" | ||
setup: | ||
- "#SBATCH -C v100-32g" # this setup is needed here and not in additional parameters | ||
# because otherwise it will be difficult to remove at run time. | ||
- "export WANDB_MODE=offline" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
hydra-core | ||
wandb | ||
hydra-submitit-launcher |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# all taken from https://www.tensorflow.org/guide/keras/functional | ||
from pathlib import Path | ||
|
||
import hydra | ||
from omegaconf import OmegaConf | ||
|
||
|
||
@hydra.main(config_path='conf', config_name='config') | ||
def train_dense_model_main(cfg): | ||
return train_dense_model(cfg) | ||
|
||
|
||
def train_dense_model(cfg): | ||
# limit imports oustide the call to the function, in order to launch quickly | ||
# when using dask | ||
import tensorflow as tf | ||
from tensorflow import keras | ||
from tensorflow.keras import layers | ||
import wandb | ||
from wandb.keras import WandbCallback | ||
|
||
|
||
def my_model(input_shape=784, output_num=10, activation='relu', hidden_size=64): | ||
inputs = keras.Input(shape=input_shape) | ||
x = layers.Dense(hidden_size, activation=activation)(inputs) | ||
x = layers.Dense(hidden_size, activation=activation)(x) | ||
outputs = layers.Dense(output_num)(x) | ||
return keras.Model(inputs=inputs, outputs=outputs, name='mnist_model') | ||
|
||
def model_compile(model, loss='xent', optimizer='rmsprop'): | ||
if loss == 'xent': | ||
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True) | ||
model.compile(loss=loss, | ||
optimizer=optimizer, | ||
metrics=['accuracy']) | ||
|
||
def data(n_train=60_000, n_test=10_000, n_features=784, n_classes=10): | ||
# training and inference | ||
# network is not reachable, so we use random data | ||
x_train = tf.random.normal((n_train, n_features), dtype='float32') | ||
x_test = tf.random.normal((n_test, n_features), dtype='float32') | ||
y_train = tf.random.uniform((n_train,), minval=0, maxval=n_classes, dtype='int32') | ||
y_test = tf.random.uniform((n_test,), minval=0, maxval=n_classes, dtype='int32') | ||
return x_train, x_test, y_train, y_test | ||
|
||
|
||
# wandb setup | ||
Path(cfg.wandb.dir).mkdir(exist_ok=True, parents=True) | ||
wandb.init( | ||
config=OmegaConf.to_container(cfg, resolve=True), | ||
**cfg.wandb, | ||
) | ||
callbacks = [ | ||
WandbCallback(monitor='loss', save_weights_only=True), | ||
] | ||
|
||
# model building | ||
tf.keras.backend.clear_session() | ||
model = my_model(**cfg.model) | ||
model_compile(model, **cfg.compile) | ||
x_train, x_test, y_train, y_test = data(**cfg.data) | ||
history = model.fit(x_train, y_train, **cfg.fit, callbacks=callbacks) | ||
test_scores = model.evaluate(x_test, y_test, verbose=2) | ||
print('Test loss:', test_scores[0]) | ||
print('Test accuracy:', test_scores[1]) | ||
return True | ||
|
||
if __name__ == '__main__': | ||
train_dense_model_main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters