-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate multiple models/datasets/languages using the CLI directly #56
Conversation
great! What do you think about putting the same explanation as this PR description directly into the readme ? |
about what else we need, I think in future PRs those would be great to increase usability some more:
|
Yes good idea |
Alright thanks, will have a look at these afterwards |
Added README, instructions on how to run the zero-shot benchmark without run.sh (with also multilingual benchmark). Will do some tests to find out if there any issues, but otherwise I believe it's fine. |
benchmark/build_csv.py
Outdated
@@ -1,8 +1,14 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's maybe integrate this in main cli ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done, looks indeed better like that. Perharps we should do the same for clip_benchmark_export_wds
(in another PR).
Looks great |
Ok, I ran again the full VTAB+ and retrieval benchmark with the new CLI, all worked fine, I just see some variation (compared to current Below I show the delta between current and benchmark that I reran: Seems to happen in fer2013, mnist, renderedsst2, diabetic retinopathy, kitti, pcam, with diabetic retinopathy being the worse. Retrieval looks fine: |
Looks good to me, let's merge it ! |
Ok then let's merge |
Issue #43
So multiple ways to do that with this PR (will add doc in the README later) .
For models, we can provide list of pretrained model names in the form of 'model,pretrained' (so
model
andpretrained
are comma separated). For datasets, we can provide a list of datasets. For languages, we can provide a list of languages.Example:
Note that
--dataset_root
and--output
can be now in the form of a template that depends on the dataset/model/language/task (for--output
) and dataset name (for--dataset_root
).We can also provide files with models or datasets list (one per line):
We can also provide model collection names (
openai
,openclip_base
,openclip_full
are supported) or dataset collection names (vtab
,vtab+
,retrieval
,imagenet_robustness
are supported):(
openclip_base
is the same as benchmark/models.txt, whileopenclip_full
is using OpenCLIP'sopen_clip.list_pretrained_models()
)The evaluation is sequential, but we can think in the future how to do evaluation in parallel in multiple GPUs (out of the scope of this PR).
@rom1504 what do you think, anything else we needed?