From c0c71818b47f23876a09cdfd2a78d8beb21c64db Mon Sep 17 00:00:00 2001 From: Kilian Date: Tue, 31 Oct 2023 15:33:02 -0400 Subject: [PATCH] update README in runner presenting VO --- runner/README.md | 395 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 395 insertions(+) diff --git a/runner/README.md b/runner/README.md index e69de29..eaa336e 100644 --- a/runner/README.md +++ b/runner/README.md @@ -0,0 +1,395 @@ +In abstracting out the relevant losses from our `pytorch-lightning` implementation we have moved all of the `pytorch-lightning` code to `runner`. Every command that worked before for lightning should now be run from within the `runner` directory. V0 is now frozen in the `V0` branch. + +Train model with default configuration + +```bash +cd runner + +# train on CPU +python src/train.py trainer=cpu + +# train on GPU +python src/train.py trainer=gpu +``` + +Train model with chosen experiment configuration from [configs/experiment/](configs/experiment/) + +```bash +python src/train.py experiment=experiment_name +``` + +You can override any parameter from command line like this + +```bash +python src/train.py trainer.max_epochs=20 datamodule.batch_size=64 +``` + +You can also train a large set of models in parallel with SLURM as shown in `scripts/two-dim-cfm.sh` which trains the models used in the first 3 lines of Table 2. + + +## Code Contributions + +This repo is extracted from a larger private codebase which loses the original commit history which contains work from other authors of the papers. + + +## Project Structure + +The directory structure of new project looks like this: + +``` +│ +│── runner <- Shell scripts +| ├── data <- Project data +| ├── logs <- Logs generated by hydra and lightning loggers +| ├── scripts <- Shell scripts +│ ├── configs <- Hydra configuration files +│ │ ├── callbacks <- Callbacks configs +│ │ ├── debug <- Debugging configs +│ │ ├── datamodule <- Datamodule configs +│ │ ├── experiment <- Experiment configs +│ │ ├── extras <- Extra utilities configs +│ │ ├── hparams_search <- Hyperparameter search configs +│ │ ├── hydra <- Hydra configs +│ │ ├── launcher <- Hydra launcher configs +│ │ ├── local <- Local configs +│ │ ├── logger <- Logger configs +│ │ ├── model <- Model configs +│ │ ├── paths <- Project paths configs +│ │ ├── trainer <- Trainer configs +│ │ │ +│ │ ├── eval.yaml <- Main config for evaluation +│ │ └── train.yaml <- Main config for training +│ ├── src <- Source code +│ │ ├── datamodules <- Lightning datamodules +│ │ ├── models <- Lightning models +│ │ ├── utils <- Utility scripts +│ │ │ +│ │ ├── eval.py <- Run evaluation +│ │ └── train.py <- Run training +│ │ +│ ├── tests <- Tests of any kind +│ └── README.md +``` + +## ⚡  Your Superpowers + +
+Override any config parameter from command line + +```bash +python train.py trainer.max_epochs=20 model.optimizer.lr=1e-4 +``` + +> **Note**: You can also add new parameters with `+` sign. + +```bash +python train.py +model.new_param="owo" +``` + +
+ +
+Train on CPU, GPU, multi-GPU and TPU + +```bash +# train on CPU +python train.py trainer=cpu + +# train on 1 GPU +python train.py trainer=gpu + +# train on TPU +python train.py +trainer.tpu_cores=8 + +# train with DDP (Distributed Data Parallel) (4 GPUs) +python train.py trainer=ddp trainer.devices=4 + +# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes) +python train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2 + +# simulate DDP on CPU processes +python train.py trainer=ddp_sim trainer.devices=2 + +# accelerate training on mac +python train.py trainer=mps +``` + +> **Warning**: Currently there are problems with DDP mode, read [this issue](https://github.com/ashleve/lightning-hydra-template/issues/393) to learn more. + +
+ +
+Train with mixed precision + +```bash +# train with pytorch native automatic mixed precision (AMP) +python train.py trainer=gpu +trainer.precision=16 +``` + +
+ + + +
+Train model with any logger available in PyTorch Lightning, like W&B or Tensorboard + +```yaml +# set project and entity names in `configs/logger/wandb` +wandb: + project: "your_project_name" + entity: "your_wandb_team_name" +``` + +```bash +# train model with Weights&Biases (link to wandb dashboard should appear in the terminal) +python train.py logger=wandb +``` + +> **Note**: Lightning provides convenient integrations with most popular logging frameworks. Learn more [here](#experiment-tracking). + +> **Note**: Using wandb requires you to [setup account](https://www.wandb.com/) first. After that just complete the config as below. + +> **Note**: Click [here](https://wandb.ai/hobglob/template-dashboard/) to see example wandb dashboard generated with this template. + +
+ +
+Train model with chosen experiment config + +```bash +python train.py experiment=example +``` + +> **Note**: Experiment configs are placed in [configs/experiment/](configs/experiment/). + +
+ +
+Attach some callbacks to run + +```bash +python train.py callbacks=default +``` + +> **Note**: Callbacks can be used for things such as as model checkpointing, early stopping and [many more](https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html#built-in-callbacks). + +> **Note**: Callbacks configs are placed in [configs/callbacks/](configs/callbacks/). + +
+ +
+Use different tricks available in Pytorch Lightning + +```yaml +# gradient clipping may be enabled to avoid exploding gradients +python train.py +trainer.gradient_clip_val=0.5 + +# run validation loop 4 times during a training epoch +python train.py +trainer.val_check_interval=0.25 + +# accumulate gradients +python train.py +trainer.accumulate_grad_batches=10 + +# terminate training after 12 hours +python train.py +trainer.max_time="00:12:00:00" +``` + +> **Note**: PyTorch Lightning provides about [40+ useful trainer flags](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#trainer-flags). + +
+ +
+Easily debug + +```bash +# runs 1 epoch in default debugging mode +# changes logging directory to `logs/debugs/...` +# sets level of all command line loggers to 'DEBUG' +# enforces debug-friendly configuration +python train.py debug=default + +# run 1 train, val and test loop, using only 1 batch +python train.py debug=fdr + +# print execution time profiling +python train.py debug=profiler + +# try overfitting to 1 batch +python train.py debug=overfit + +# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf +python train.py +trainer.detect_anomaly=true + +# log second gradient norm of the model +python train.py +trainer.track_grad_norm=2 + +# use only 20% of the data +python train.py +trainer.limit_train_batches=0.2 \ ++trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2 +``` + +> **Note**: Visit [configs/debug/](configs/debug/) for different debugging configs. + +
+ +
+Resume training from checkpoint + +```yaml +python train.py ckpt_path="/path/to/ckpt/name.ckpt" +``` + +> **Note**: Checkpoint can be either path or URL. + +> **Note**: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release. + +
+ +
+Evaluate checkpoint on test dataset + +```yaml +python eval.py ckpt_path="/path/to/ckpt/name.ckpt" +``` + +> **Note**: Checkpoint can be either path or URL. + +
+ +
+Create a sweep over hyperparameters + +```bash +# this will run 6 experiments one after the other, +# each with different combination of batch_size and learning rate +python train.py -m datamodule.batch_size=32,64,128 model.lr=0.001,0.0005 +``` + +> **Note**: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted. + +
+ +
+Create a sweep over hyperparameters with Optuna + +```bash +# this will run hyperparameter search defined in `configs/hparams_search/mnist_optuna.yaml` +# over chosen experiment config +python train.py -m hparams_search=mnist_optuna experiment=example +``` + +> **Note**: Using [Optuna Sweeper](https://hydra.cc/docs/next/plugins/optuna_sweeper) doesn't require you to add any boilerplate to your code, everything is defined in a [single config file](configs/hparams_search/mnist_optuna.yaml). + +> **Warning**: Optuna sweeps are not failure-resistant (if one job crashes then the whole sweep crashes). + +
+ +
+Execute all experiments from folder + +```bash +python train.py -m 'experiment=glob(*)' +``` + +> **Note**: Hydra provides special syntax for controlling behavior of multiruns. Learn more [here](https://hydra.cc/docs/next/tutorials/basic/running_your_app/multi-run). The command above executes all experiments from [configs/experiment/](configs/experiment/). + +
+ +
+Execute run for multiple different seeds + +```bash +python train.py -m seed=1,2,3,4,5 trainer.deterministic=True logger=csv tags=["benchmark"] +``` + +> **Note**: `trainer.deterministic=True` makes pytorch more deterministic but impacts the performance. + +
+ +
+Execute sweep on a remote AWS cluster + +> **Note**: This should be achievable with simple config using [Ray AWS launcher for Hydra](https://hydra.cc/docs/next/plugins/ray_launcher). Example is not implemented in this template. + +
+ + + +
+Use Hydra tab completion + +> **Note**: Hydra allows you to autocomplete config argument overrides in shell as you write them, by pressing `tab` key. Read the [docs](https://hydra.cc/docs/tutorials/basic/running_your_app/tab_completion). + +
+ +
+Apply pre-commit hooks + +```bash +pre-commit run -a +``` + +> **Note**: Apply pre-commit hooks to do things like auto-formatting code and configs, performing code analysis or removing output from jupyter notebooks. See [# Best Practices](#best-practices) for more. + +
+ +
+Run tests + +```bash +# run all tests +pytest + +# run tests from specific file +pytest tests/test_train.py + +# run all tests except the ones marked as slow +pytest -k "not slow" +``` + +
+ +
+Use tags + +Each experiment should be tagged in order to easily filter them across files or in logger UI: + +```bash +python train.py tags=["mnist","experiment_X"] +``` + +If no tags are provided, you will be asked to input them from command line: + +```bash +>>> python train.py tags=[] +[2022-07-11 15:40:09,358][src.utils.utils][INFO] - Enforcing tags! +[2022-07-11 15:40:09,359][src.utils.rich_utils][WARNING] - No tags provided in config. Prompting user to input tags... +Enter a list of comma separated tags (dev): +``` + +If no tags are provided for multirun, an error will be raised: + +```bash +>>> python train.py -m +x=1,2,3 tags=[] +ValueError: Specify tags before launching a multirun! +``` + +> **Note**: Appending lists from command line is currently not supported in hydra :( + +
+ +
\ No newline at end of file