update README in runner presenting VO

atong01 · Oct 31, 2023 · c0c7181 · c0c7181
1 parent 13be42b
commit c0c7181
Showing 1 changed file with 395 additions and 0 deletions.
diff --git a/runner/README.md b/runner/README.md
@@ -0,0 +1,395 @@
+In abstracting out the relevant losses from our `pytorch-lightning` implementation we have moved all of the `pytorch-lightning` code to `runner`. Every command that worked before for lightning should now be run from within the `runner` directory. V0 is now frozen in the `V0` branch.
+
+Train model with default configuration
+
+```bash
+cd runner
+
+# train on CPU
+python src/train.py trainer=cpu
+
+# train on GPU
+python src/train.py trainer=gpu
+```
+
+Train model with chosen experiment configuration from [configs/experiment/](configs/experiment/)
+
+```bash
+python src/train.py experiment=experiment_name
+```
+
+You can override any parameter from command line like this
+
+```bash
+python src/train.py trainer.max_epochs=20 datamodule.batch_size=64
+```
+
+You can also train a large set of models in parallel with SLURM as shown in `scripts/two-dim-cfm.sh` which trains the models used in the first 3 lines of Table 2.
+
+
+## Code Contributions
+
+This repo is extracted from a larger private codebase which loses the original commit history which contains work from other authors of the papers.
+
+
+## Project Structure
+
+The directory structure of new project looks like this:
+
+```
+│
+│── runner                    <- Shell scripts
+|   ├── data                   <- Project data
+|   ├── logs                   <- Logs generated by hydra and lightning loggers
+|   ├── scripts                   <- Shell scripts
+│   ├── configs                   <- Hydra configuration files
+│   │   ├── callbacks                <- Callbacks configs
+│   │   ├── debug                    <- Debugging configs
+│   │   ├── datamodule               <- Datamodule configs
+│   │   ├── experiment               <- Experiment configs
+│   │   ├── extras                   <- Extra utilities configs
+│   │   ├── hparams_search           <- Hyperparameter search configs
+│   │   ├── hydra                    <- Hydra configs
+│   │   ├── launcher                 <- Hydra launcher configs
+│   │   ├── local                    <- Local configs
+│   │   ├── logger                   <- Logger configs
+│   │   ├── model                    <- Model configs
+│   │   ├── paths                    <- Project paths configs
+│   │   ├── trainer                  <- Trainer configs
+│   │   │
+│   │   ├── eval.yaml             <- Main config for evaluation
+│   │   └── train.yaml            <- Main config for training
+│   ├── src                       <- Source code
+│   │   ├── datamodules              <- Lightning datamodules
+│   │   ├── models                   <- Lightning models
+│   │   ├── utils                    <- Utility scripts
+│   │   │
+│   │   ├── eval.py                  <- Run evaluation
+│   │   └── train.py                 <- Run training
+│   │
+│   ├── tests                     <- Tests of any kind
+│   └── README.md
+```
+
+## ⚡  Your Superpowers
+
+<details>
+<summary><b>Override any config parameter from command line</b></summary>
+
+```bash
+python train.py trainer.max_epochs=20 model.optimizer.lr=1e-4
+```
+
+> **Note**: You can also add new parameters with `+` sign.
+
+```bash
+python train.py +model.new_param="owo"
+```
+
+</details>
+
+<details>
+<summary><b>Train on CPU, GPU, multi-GPU and TPU</b></summary>
+
+```bash
+# train on CPU
+python train.py trainer=cpu
+
+# train on 1 GPU
+python train.py trainer=gpu
+
+# train on TPU
+python train.py +trainer.tpu_cores=8
+
+# train with DDP (Distributed Data Parallel) (4 GPUs)
+python train.py trainer=ddp trainer.devices=4
+
+# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
+python train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2
+
+# simulate DDP on CPU processes
+python train.py trainer=ddp_sim trainer.devices=2
+
+# accelerate training on mac
+python train.py trainer=mps
+```
+
+> **Warning**: Currently there are problems with DDP mode, read [this issue](https://github.com/ashleve/lightning-hydra-template/issues/393) to learn more.
+
+</details>
+
+<details>
+<summary><b>Train with mixed precision</b></summary>
+
+```bash
+# train with pytorch native automatic mixed precision (AMP)
+python train.py trainer=gpu +trainer.precision=16
+```
+
+</details>
+
+<!-- deepspeed support still in beta
+<details>
+<summary><b>Optimize large scale models on multiple GPUs with Deepspeed</b></summary>
+
+```bash
+python train.py +trainer.
+```
+
+</details>
+ -->
+
+<details>
+<summary><b>Train model with any logger available in PyTorch Lightning, like W&B or Tensorboard</b></summary>
+
+```yaml
+# set project and entity names in `configs/logger/wandb`
+wandb:
+  project: "your_project_name"
+  entity: "your_wandb_team_name"
+```
+
+```bash
+# train model with Weights&Biases (link to wandb dashboard should appear in the terminal)
+python train.py logger=wandb
+```
+
+> **Note**: Lightning provides convenient integrations with most popular logging frameworks. Learn more [here](#experiment-tracking).
+
+> **Note**: Using wandb requires you to [setup account](https://www.wandb.com/) first. After that just complete the config as below.
+
+> **Note**: Click [here](https://wandb.ai/hobglob/template-dashboard/) to see example wandb dashboard generated with this template.
+
+</details>
+
+<details>
+<summary><b>Train model with chosen experiment config</b></summary>
+
+```bash
+python train.py experiment=example
+```
+
+> **Note**: Experiment configs are placed in [configs/experiment/](configs/experiment/).
+
+</details>
+
+<details>
+<summary><b>Attach some callbacks to run</b></summary>
+
+```bash
+python train.py callbacks=default
+```
+
+> **Note**: Callbacks can be used for things such as as model checkpointing, early stopping and [many more](https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html#built-in-callbacks).
+
+> **Note**: Callbacks configs are placed in [configs/callbacks/](configs/callbacks/).
+
+</details>
+
+<details>
+<summary><b>Use different tricks available in Pytorch Lightning</b></summary>
+
+```yaml
+# gradient clipping may be enabled to avoid exploding gradients
+python train.py +trainer.gradient_clip_val=0.5
+
+# run validation loop 4 times during a training epoch
+python train.py +trainer.val_check_interval=0.25
+
+# accumulate gradients
+python train.py +trainer.accumulate_grad_batches=10
+
+# terminate training after 12 hours
+python train.py +trainer.max_time="00:12:00:00"
+```
+
+> **Note**: PyTorch Lightning provides about [40+ useful trainer flags](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#trainer-flags).
+
+</details>
+
+<details>
+<summary><b>Easily debug</b></summary>
+
+```bash
+# runs 1 epoch in default debugging mode
+# changes logging directory to `logs/debugs/...`
+# sets level of all command line loggers to 'DEBUG'
+# enforces debug-friendly configuration
+python train.py debug=default
+
+# run 1 train, val and test loop, using only 1 batch
+python train.py debug=fdr
+
+# print execution time profiling
+python train.py debug=profiler
+
+# try overfitting to 1 batch
+python train.py debug=overfit
+
+# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf
+python train.py +trainer.detect_anomaly=true
+
+# log second gradient norm of the model
+python train.py +trainer.track_grad_norm=2
+
+# use only 20% of the data
+python train.py +trainer.limit_train_batches=0.2 \
++trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2
+```
+
+> **Note**: Visit [configs/debug/](configs/debug/) for different debugging configs.
+
+</details>
+
+<details>
+<summary><b>Resume training from checkpoint</b></summary>
+
+```yaml
+python train.py ckpt_path="/path/to/ckpt/name.ckpt"
+```
+
+> **Note**: Checkpoint can be either path or URL.
+
+> **Note**: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release.
+
+</details>
+
+<details>
+<summary><b>Evaluate checkpoint on test dataset</b></summary>
+
+```yaml
+python eval.py ckpt_path="/path/to/ckpt/name.ckpt"
+```
+
+> **Note**: Checkpoint can be either path or URL.
+
+</details>
+
+<details>
+<summary><b>Create a sweep over hyperparameters</b></summary>
+
+```bash
+# this will run 6 experiments one after the other,
+# each with different combination of batch_size and learning rate
+python train.py -m datamodule.batch_size=32,64,128 model.lr=0.001,0.0005
+```
+
+> **Note**: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted.
+
+</details>
+
+<details>
+<summary><b>Create a sweep over hyperparameters with Optuna</b></summary>
+
+```bash
+# this will run hyperparameter search defined in `configs/hparams_search/mnist_optuna.yaml`
+# over chosen experiment config
+python train.py -m hparams_search=mnist_optuna experiment=example
+```
+
+> **Note**: Using [Optuna Sweeper](https://hydra.cc/docs/next/plugins/optuna_sweeper) doesn't require you to add any boilerplate to your code, everything is defined in a [single config file](configs/hparams_search/mnist_optuna.yaml).
+
+> **Warning**: Optuna sweeps are not failure-resistant (if one job crashes then the whole sweep crashes).
+
+</details>
+
+<details>
+<summary><b>Execute all experiments from folder</b></summary>
+
+```bash
+python train.py -m 'experiment=glob(*)'
+```
+
+> **Note**: Hydra provides special syntax for controlling behavior of multiruns. Learn more [here](https://hydra.cc/docs/next/tutorials/basic/running_your_app/multi-run). The command above executes all experiments from [configs/experiment/](configs/experiment/).
+
+</details>
+
+<details>
+<summary><b>Execute run for multiple different seeds</b></summary>
+
+```bash
+python train.py -m seed=1,2,3,4,5 trainer.deterministic=True logger=csv tags=["benchmark"]
+```
+
+> **Note**: `trainer.deterministic=True` makes pytorch more deterministic but impacts the performance.
+
+</details>
+
+<details>
+<summary><b>Execute sweep on a remote AWS cluster</b></summary>
+
+> **Note**: This should be achievable with simple config using [Ray AWS launcher for Hydra](https://hydra.cc/docs/next/plugins/ray_launcher). Example is not implemented in this template.
+
+</details>
+
+<!-- <details>
+<summary><b>Execute sweep on a SLURM cluster</b></summary>
+
+> This should be achievable with either [the right lightning trainer flags](https://pytorch-lightning.readthedocs.io/en/latest/clouds/cluster.html?highlight=SLURM#slurm-managed-cluster) or simple config using [Submitit launcher for Hydra](https://hydra.cc/docs/plugins/submitit_launcher). Example is not yet implemented in this template.
+
+</details> -->
+
+<details>
+<summary><b>Use Hydra tab completion</b></summary>
+
+> **Note**: Hydra allows you to autocomplete config argument overrides in shell as you write them, by pressing `tab` key. Read the [docs](https://hydra.cc/docs/tutorials/basic/running_your_app/tab_completion).
+
+</details>
+
+<details>
+<summary><b>Apply pre-commit hooks</b></summary>
+
+```bash
+pre-commit run -a
+```
+
+> **Note**: Apply pre-commit hooks to do things like auto-formatting code and configs, performing code analysis or removing output from jupyter notebooks. See [# Best Practices](#best-practices) for more.
+
+</details>
+
+<details>
+<summary><b>Run tests</b></summary>
+
+```bash
+# run all tests
+pytest
+
+# run tests from specific file
+pytest tests/test_train.py
+
+# run all tests except the ones marked as slow
+pytest -k "not slow"
+```
+
+</details>
+
+<details>
+<summary><b>Use tags</b></summary>
+
+Each experiment should be tagged in order to easily filter them across files or in logger UI:
+
+```bash
+python train.py tags=["mnist","experiment_X"]
+```
+
+If no tags are provided, you will be asked to input them from command line:
+
+```bash
+>>> python train.py tags=[]
+[2022-07-11 15:40:09,358][src.utils.utils][INFO] - Enforcing tags! <cfg.extras.enforce_tags=True>
+[2022-07-11 15:40:09,359][src.utils.rich_utils][WARNING] - No tags provided in config. Prompting user to input tags...
+Enter a list of comma separated tags (dev):
+```
+
+If no tags are provided for multirun, an error will be raised:
+
+```bash
+>>> python train.py -m +x=1,2,3 tags=[]
+ValueError: Specify tags before launching a multirun!
+```
+
+> **Note**: Appending lists from command line is currently not supported in hydra :(
+
+</details>
+
+<br>