Evaluate multiple models/datasets/languages using the CLI directly (#56)

* Evaluate multiple models/datasets/languages using the CLI directly * minor fix: forgot import * output file template, for pretrained and dataset: make sure to not have a "/" in them. * add multilingual_mscoco_captions to get_dataset_default_task (was missing) * minor * update README * instructions on how to run the benchmark and build the CSV. Remove run.sh, no longer needed. * support and use multilingual openclip model collection * Support skipping evaluations that are alread done * document skip_existing on README * fix pretrained name for g-14 * add build_csv.py into the main CLI * add mising args to the test * update README * update README * update README
LAION-AI · Dec 26, 2022 · 79d28fa · 79d28fa
1 parent 06f26c8
commit 79d28fa
Show file tree

Hide file tree

Showing 8 changed files with 368 additions and 137 deletions.
diff --git a/README.md b/README.md
@@ -44,7 +44,9 @@ the results are written into a JSON file.
 
  Here is an example for CIFAR-10 zero-shot classification using OpenCLIP's pre-trained model on LAION-400m:
 
- `clip_benchmark --dataset=cifar10 --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
+ `clip_benchmark eval --dataset=cifar10 --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
+
+ By default, the dataset is downloaded into `--dataset_root`, which by default is `root`.
 
 Here is the content of `result.json` after the evaluation is done:
 
@@ -56,11 +58,12 @@ Here is the content of `result.json` after the evaluation is done:
 }
 ```
 
+
 ### VOC2007 example
 
 Here is another example with VOC2007, which is a multi-label classification dataset.
 
- `clip_benchmark --dataset=voc2007_multilabel --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
+ `clip_benchmark eval --dataset=voc2007_multilabel --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
 
 Here is the content of `result.json` after the evaluation is done:
 
@@ -77,21 +80,16 @@ First, you need to install VTAB's dedicated package.
 
 `pip install task_adaptation==0.1`
 
-The name of the dataset follows the template `vtab/<TASK_NAME>`.
-To have the list of the 19 classification tasks using in VTAB, you can use:
-
-`python -c 'from clip_benchmark.datasets.builder import VTAB_19TASKS;print("\n".join(VTAB_19TASKS))'`
-
-
 Then, you can run it by providing the full dataset name.
 Example with `eurosat`:
 
- `clip_benchmark --dataset=vtab/eurosat --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
-
+ `clip_benchmark eval --dataset=vtab/eurosat --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
 
-### TensorFlow dataset example
+See [clip_benchmark/datasets/builder.py#L634](clip_benchmark/datasets/builder.py#L634) for the full list of 
+VTAB dataset collection.
 
 
+### TensorFlow dataset example
 
 Here is an example on how to run it on [Tensorflow datasets](https://www.tensorflow.org/datasets).
 First, you need to install `tfds-nightly` and `timm`.
@@ -103,22 +101,18 @@ The name of the dataset follows the template `tfds/<DATASET_NAME>`.
 
 Example with `cifar10`:
 
- `clip_benchmark --dataset=tfds/cifar10 --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
+ `clip_benchmark eval --dataset=tfds/cifar10 --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
 
 
 ### COCO captions example
 
  Here is an example for COCO captions zero-shot retrieval:
 
- `clip_benchmark --dataset=mscoco_captions --task=zeroshot_retrieval --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --dataset_root=<PATH_TO_IMAGE_FOLDER> --annotation_file=<PATH_TO_ANNOTATION_FILE> --batch_size=64` 
+ `clip_benchmark eval --dataset=mscoco_captions --task=zeroshot_retrieval --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64` 
 
- (see <https://cocodataset.org/#home> for instructions on how to download)
+ Note that for using COCO, you also need to install `pycocotools` (e.g., using `pip install pycocotools`).
 
- Note that for using COCO, you also need to install `pycocotools`, using:
-
- `pip install pycocotools`
-
- ### Webdataset example
+### Webdataset example
 
 Here is an example on how to run it on [webdatasets](https://github.com/webdataset/webdataset).
 First, you need to install `webdataset`.
@@ -162,17 +156,57 @@ The name of the dataset follows the template `wds/<DATASET_NAME>`. Note that the
 Example with `cifar10`:
 
 ```
-$ clip_benchmark --dataset wds/cifar10 --dataset_root ROOT_DIR/wds_cifar10/
-$ clip_benchmark --dataset wds/cifar10 --dataset_root https://huggingface.co/datasets/djghosh/wds_cifar10_test/tree/main
+$ clip_benchmark eval --dataset wds/cifar10 --dataset_root ROOT_DIR/wds_cifar10/
+$ clip_benchmark eval --dataset wds/cifar10 --dataset_root https://huggingface.co/datasets/djghosh/wds_cifar10_test/tree/main
 ```
 
 All other arguments remain the same as in the other examples.
 
-### API
+## Evaluate mulitple models on multiple datasets
+
+For the purpose of benchmarking, it is possible to run the CLI with multiple
+pre-trained models on multiple datasets.
+
+
+### Pretrained models and datasets list as arguments
+
+For models, we can provide list of pretrained model names in the form of 'model,pretrained' (so `model` and `pretrained` are comma separated). For datasets, we can provide a list of datasets.  For languages, we can provide a list of languages.
+Example:
+
+```bash
+clip_benchmark eval --pretrained_model  ViT-B-32-quickgelu,laion400m_e32 ViT-L-14,laion400m_e32  \
+--dataset cifar10 cifar100 --dataset_root "clip_benchmark_datasets/{dataset}" --language en jp \
+--verbose --output "{dataset}_{pretrained}_{model}_{language}_{task}.json"
+```
+
+Note that `--dataset_root` and `--output` can be now in the form of a template that depends on the dataset/model/language/task (for `--output`) and dataset name (for `--dataset_root`).
+
+Note that If the benchmark fails at some point, it is possible to resume it by skipping already evaluated models using `--skip_existing`.
+
+### Pretrained models and datasets list as files
+
+We can also provide a path to files with models (each line is in the form of 'model,pretrained' where `model` and `pretrained` are comma separated) and datasets list (one dataset per line):
+
+```bash
+clip_benchmark eval --pretrained_model  benchmark/models.txt \
+--dataset benchmark/datasets.txt --dataset_root "clip_benchmark_datasets/{dataset}"  \
+--verbose --output "{dataset}_{pretrained}_{model}_{language}_{task}.json"
+```
+
+Examples are available in [benchmark/datasets.txt](benchmark/datasets.txt) and [benchmark/models.txt](benchmark/models.txt)
+
+### Model and dataset collections
+
+We can also provide model collection names (`openai`, `openclip_base`, `openclip_multilingual`, `openclip_full` are supported) or dataset collection names (`vtab`, `vtab+`, `retrieval`, `imagenet_robustness` are supported):
+
+```bash
+clip_benchmark eval --pretrained_model openai openclip_base  --dataset vtab+ retrieval \
+--dataset_root "clip_benchmark_datasets/{dataset}" --verbose \
+--output "{dataset}_{pretrained}_{model}_{language}_{task}.json"
+```
 
-You can also use the API directly. This is especially useful if your model
-does not belong to currently supported models.
-(TODO)
+See [clip_benchmark/models.py#L6](clip_benchmark/models.py#L6) and [clip_benchmark/datasets/builder.py#L634](clip_benchmark/datasets/builder.py#L634) for more information
+about the collections.
 
 ## Credits
 

diff --git a/benchmark/README.md b/benchmark/README.md
@@ -5,10 +5,41 @@ You can visualize the results in the [notebook](results.ipynb)
 
 # How to reproduce th CLIP benchmark results
 
-To reproduce the results, you need to run
 
-`bash run.sh`
+## VTAB+ and retrieval datasets (MS-COCO, Flickr30k, Flickr8k)
 
-this will take a while and create `benchmark.csv` which contains
-all the results.
+```bash
+clip_benchmark eval --pretrained_model  openai openclip_base  --dataset vtab+ retrieval \
+--dataset_root "clip_benchmark_datasets/{dataset}" \
+--output "vtab_plus_and_retrieval_{dataset}_{pretrained}_{model}_{language}_{task}.json"
+```
+(Change `--dataset_root` accordingly)
 
+Once the evaluation finishes, you can construct a CSV with all the results:
+
+```bash
+clip_benchmark build vtab_plus_and_retrieval*.json --output=benchmark.csv
+```
+
+## Multilingual ImageNet benchmark
+
+To run the multilingual ImageNet benchmark, use:
+
+```bash
+clip_benchmark eval --pretrained_model openclip_multilingual openclip_base openai  --dataset imagenet1k --language cn it jp en \
+--dataset_root "clip_benchmark_datasets/{dataset}" \
+--output "multilingual_{dataset}_{pretrained}_{model}_{language}_{task}.json"
+```
+(Change `--dataset_root` accordingly)
+
+## Multilingual MS-COCO benchmark
+
+To run the multilingual MS-COCO benchmark, use:
+
+```bash
+clip_benchmark eval --pretrained_model openclip_multilingual openclip_base openai --dataset multilingual_mscoco_captions --language es it ko pl ru tr zh en \
+--dataset_root "clip_benchmark_datasets/{dataset}" \
+--output "multilingual_{dataset}_{pretrained}_{model}_{language}_{task}.json"
+```
+
+(Change `--dataset_root` accordingly)
diff --git a/benchmark/build_csv.py b/benchmark/build_csv.py
diff --git a/benchmark/run.sh b/benchmark/run.sh