Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate multiple models/datasets/languages using the CLI directly #56

Merged
merged 16 commits into from
Dec 26, 2022

Conversation

mehdidc
Copy link
Collaborator

@mehdidc mehdidc commented Dec 21, 2022

Issue #43

So multiple ways to do that with this PR (will add doc in the README later) .

For models, we can provide list of pretrained model names in the form of 'model,pretrained' (so model and pretrained are comma separated). For datasets, we can provide a list of datasets. For languages, we can provide a list of languages.
Example:

clip_benchmark --pretrained_model  ViT-B-32-quickgelu,laion400m_e32 ViT-L-14,laion400m_e32  \
--dataset cifar10 cifar100 --dataset_root "clip_benchmark_datasets/{dataset}" --language en jp \
--verbose --output "{dataset}_{pretrained}_{model}_{language}_{task}.json"

Note that --dataset_root and --output can be now in the form of a template that depends on the dataset/model/language/task (for --output) and dataset name (for --dataset_root).

We can also provide files with models or datasets list (one per line):

clip_benchmark --pretrained_model  benchmark/models.txt \
--dataset benchmark/datasets.txt --dataset_root "clip_benchmark_datasets/{dataset}"  \
--verbose --output "{dataset}_{pretrained}_{model}_{language}_{task}.json"

We can also provide model collection names (openai, openclip_base, openclip_full are supported) or dataset collection names (vtab, vtab+, retrieval, imagenet_robustness are supported):

clip_benchmark --pretrained_model  openai openclip_base  --dataset vtab+ retrieval \
--dataset_root "clip_benchmark_datasets/{dataset}" --verbose \
--output "{dataset}_{pretrained}_{model}_{language}_{task}.json"

(openclip_base is the same as benchmark/models.txt, while openclip_full is using OpenCLIP's open_clip.list_pretrained_models())

The evaluation is sequential, but we can think in the future how to do evaluation in parallel in multiple GPUs (out of the scope of this PR).

@rom1504 what do you think, anything else we needed?

@rom1504
Copy link
Contributor

rom1504 commented Dec 21, 2022

great!

What do you think about putting the same explanation as this PR description directly into the readme ?

@rom1504
Copy link
Contributor

rom1504 commented Dec 21, 2022

about what else we need, I think in future PRs those would be great to increase usability some more:

  • generate important diagrams from the notebook automatically. much easier to read a picture than a csv!
  • put all datasets in wds #52 to make it even easier to get datasets

@mehdidc
Copy link
Collaborator Author

mehdidc commented Dec 22, 2022

great!

What do you think about putting the same explanation as this PR description directly into the readme ?

Yes good idea

@mehdidc
Copy link
Collaborator Author

mehdidc commented Dec 22, 2022

about what else we need, I think in future PRs those would be great to increase usability some more:

* generate important diagrams from the notebook automatically. much easier to read a picture than a csv!

* [put all datasets in wds #52](https://github.com/LAION-AI/CLIP_benchmark/issues/52) to make it even easier to get datasets

Alright thanks, will have a look at these afterwards

@mehdidc
Copy link
Collaborator Author

mehdidc commented Dec 23, 2022

Added README, instructions on how to run the zero-shot benchmark without run.sh (with also multilingual benchmark). Will do some tests to find out if there any issues, but otherwise I believe it's fine.

@@ -1,8 +1,14 @@
import argparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe integrate this in main cli ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done, looks indeed better like that. Perharps we should do the same for clip_benchmark_export_wds (in another PR).

@rom1504
Copy link
Contributor

rom1504 commented Dec 23, 2022

Looks great

@mehdidc
Copy link
Collaborator Author

mehdidc commented Dec 26, 2022

Ok, I ran again the full VTAB+ and retrieval benchmark with the new CLI, all worked fine, I just see some variation (compared to current benchmark.csv) in the numbers in the 3rd digit after decimal point, and a few times in the 2nd digit after decimal point, perhaps it is because of AMP (which I use now, didn't use before). Variation in 3rd digit after decimal point is probably okay, in the second digit I'd be more worried, will need to investigate the effect of AMP after this PR.

Below I show the delta between current and benchmark that I reran:

delta

Seems to happen in fer2013, mnist, renderedsst2, diabetic retinopathy, kitti, pcam, with diabetic retinopathy being the worse.

Retrieval looks fine:

retrieval

@rom1504
Copy link
Contributor

rom1504 commented Dec 26, 2022

Looks good to me, let's merge it !

@mehdidc
Copy link
Collaborator Author

mehdidc commented Dec 26, 2022

Ok then let's merge

@mehdidc mehdidc merged commit 79d28fa into main Dec 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants