Evaluate multiple models/datasets/languages using the CLI directly #56

mehdidc · 2022-12-21T13:44:47Z

Issue #43

So multiple ways to do that with this PR (will add doc in the README later) .

For models, we can provide list of pretrained model names in the form of 'model,pretrained' (so model and pretrained are comma separated). For datasets, we can provide a list of datasets. For languages, we can provide a list of languages.
Example:

clip_benchmark --pretrained_model  ViT-B-32-quickgelu,laion400m_e32 ViT-L-14,laion400m_e32  \
--dataset cifar10 cifar100 --dataset_root "clip_benchmark_datasets/{dataset}" --language en jp \
--verbose --output "{dataset}_{pretrained}_{model}_{language}_{task}.json"

Note that --dataset_root and --output can be now in the form of a template that depends on the dataset/model/language/task (for --output) and dataset name (for --dataset_root).

We can also provide files with models or datasets list (one per line):

clip_benchmark --pretrained_model  benchmark/models.txt \
--dataset benchmark/datasets.txt --dataset_root "clip_benchmark_datasets/{dataset}"  \
--verbose --output "{dataset}_{pretrained}_{model}_{language}_{task}.json"

We can also provide model collection names (openai, openclip_base, openclip_full are supported) or dataset collection names (vtab, vtab+, retrieval, imagenet_robustness are supported):

clip_benchmark --pretrained_model  openai openclip_base  --dataset vtab+ retrieval \
--dataset_root "clip_benchmark_datasets/{dataset}" --verbose \
--output "{dataset}_{pretrained}_{model}_{language}_{task}.json"

(openclip_base is the same as benchmark/models.txt, while openclip_full is using OpenCLIP's open_clip.list_pretrained_models())

The evaluation is sequential, but we can think in the future how to do evaluation in parallel in multiple GPUs (out of the scope of this PR).

@rom1504 what do you think, anything else we needed?

rom1504 · 2022-12-21T23:19:20Z

great!

What do you think about putting the same explanation as this PR description directly into the readme ?

rom1504 · 2022-12-21T23:22:16Z

about what else we need, I think in future PRs those would be great to increase usability some more:

generate important diagrams from the notebook automatically. much easier to read a picture than a csv!
put all datasets in wds #52 to make it even easier to get datasets

…ve a "/" in them.

…sing)

mehdidc · 2022-12-22T10:21:00Z

great!

What do you think about putting the same explanation as this PR description directly into the readme ?

Yes good idea

mehdidc · 2022-12-22T10:22:16Z

about what else we need, I think in future PRs those would be great to increase usability some more:

* generate important diagrams from the notebook automatically. much easier to read a picture than a csv!

* [put all datasets in wds #52](https://github.com/LAION-AI/CLIP_benchmark/issues/52) to make it even easier to get datasets

Alright thanks, will have a look at these afterwards

…n.sh, no longer needed.

mehdidc · 2022-12-23T15:24:11Z

Added README, instructions on how to run the zero-shot benchmark without run.sh (with also multilingual benchmark). Will do some tests to find out if there any issues, but otherwise I believe it's fine.

rom1504 · 2022-12-23T19:39:20Z

benchmark/build_csv.py

@@ -1,8 +1,14 @@
+import argparse


Let's maybe integrate this in main cli ?

Thanks, done, looks indeed better like that. Perharps we should do the same for clip_benchmark_export_wds (in another PR).

rom1504 · 2022-12-23T19:40:19Z

Looks great

mehdidc · 2022-12-26T11:35:28Z

Ok, I ran again the full VTAB+ and retrieval benchmark with the new CLI, all worked fine, I just see some variation (compared to current benchmark.csv) in the numbers in the 3rd digit after decimal point, and a few times in the 2nd digit after decimal point, perhaps it is because of AMP (which I use now, didn't use before). Variation in 3rd digit after decimal point is probably okay, in the second digit I'd be more worried, will need to investigate the effect of AMP after this PR.

Below I show the delta between current and benchmark that I reran:

Seems to happen in fer2013, mnist, renderedsst2, diabetic retinopathy, kitti, pcam, with diabetic retinopathy being the worse.

Retrieval looks fine:

rom1504 · 2022-12-26T12:23:40Z

Looks good to me, let's merge it !

mehdidc · 2022-12-26T16:24:34Z

Ok then let's merge

mehdidc added 2 commits December 21, 2022 14:13

Evaluate multiple models/datasets/languages using the CLI directly

9897036

minor fix: forgot import

31c2210

mehdidc added 3 commits December 22, 2022 03:17

output file template, for pretrained and dataset: make sure to not ha…

e8a0f28

…ve a "/" in them.

add multilingual_mscoco_captions to get_dataset_default_task (was mis…

c48b030

…sing)

minor

e03563c

mehdidc added 3 commits December 22, 2022 11:55

update README

c37fe71

instructions on how to run the benchmark and build the CSV. Remove ru…

d4c15b6

…n.sh, no longer needed.

support and use multilingual openclip model collection

d1f8785

mehdidc mentioned this pull request Dec 23, 2022

Update benchmark run script (case where pretrained is a path) #57

Closed

mehdidc added 2 commits December 23, 2022 19:05

Support skipping evaluations that are alread done

bd29716

document skip_existing on README

7b3027b

rom1504 reviewed Dec 23, 2022

View reviewed changes

mehdidc added 6 commits December 24, 2022 10:49

fix pretrained name for g-14

09a8095

add build_csv.py into the main CLI

c487a48

add mising args to the test

ef91bbf

update README

aaf9f04

update README

66eb85e

update README

78286ee

mehdidc mentioned this pull request Dec 26, 2022

Variation in the benchmark #59

Open

mehdidc merged commit 79d28fa into main Dec 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate multiple models/datasets/languages using the CLI directly #56

Evaluate multiple models/datasets/languages using the CLI directly #56

mehdidc commented Dec 21, 2022 •

edited

Loading

rom1504 commented Dec 21, 2022

rom1504 commented Dec 21, 2022

mehdidc commented Dec 22, 2022

mehdidc commented Dec 22, 2022

mehdidc commented Dec 23, 2022

rom1504 Dec 23, 2022

mehdidc Dec 26, 2022

rom1504 commented Dec 23, 2022

mehdidc commented Dec 26, 2022 •

edited

Loading

rom1504 commented Dec 26, 2022

mehdidc commented Dec 26, 2022

Evaluate multiple models/datasets/languages using the CLI directly #56

Evaluate multiple models/datasets/languages using the CLI directly #56

Conversation

mehdidc commented Dec 21, 2022 • edited Loading

rom1504 commented Dec 21, 2022

rom1504 commented Dec 21, 2022

mehdidc commented Dec 22, 2022

mehdidc commented Dec 22, 2022

mehdidc commented Dec 23, 2022

rom1504 Dec 23, 2022

Choose a reason for hiding this comment

mehdidc Dec 26, 2022

Choose a reason for hiding this comment

rom1504 commented Dec 23, 2022

mehdidc commented Dec 26, 2022 • edited Loading

rom1504 commented Dec 26, 2022

mehdidc commented Dec 26, 2022

mehdidc commented Dec 21, 2022 •

edited

Loading

mehdidc commented Dec 26, 2022 •

edited

Loading