put all datasets in wds #52

rom1504 · 2022-12-13T19:58:49Z

follow up of #47

put all datasets at hf
then to reference all the datasets links in a file
and adapt run.sh to have an option to use webdataset as a source, so it works for all datasets
important to have enough shards (at least 4)

djghosh13 · 2022-12-14T23:30:44Z

Will leave a note here to also add an option to download (cache) when loading from HF. Mainly, I need to add another CLI option/parameter somewhere, since both source URL (currently specified by --dataset_root) and destination download path should be specified.

djghosh13 · 2022-12-18T19:00:07Z

Another related note: adding some form of support for different languages of classnames and templates to avoid needing to duplicate the whole dataset for multilingual eval

rom1504 · 2022-12-26T22:55:46Z

@djghosh13 would definitely be great to do this issue if you're still interested

usuyama · 2022-12-30T06:06:17Z

+1 for wds!

I tried HF datasets for images before, but somehow didn't like that much. Maybe because Arrow wasn't very flexible / intuitive for images.

djghosh13 · 2022-12-31T19:12:53Z

Definitely still interested! What are your thoughts on the implementation of this point?

Another related note: adding some form of support for different languages of classnames and templates to avoid needing to duplicate the whole dataset for multilingual eval

Everything else should be straightforward once I actually get around to it.

rom1504 · 2022-12-31T22:56:58Z

We already have support for that, check the code

Currently the path that was taken is to put the other languages prompt and classnames directly in this repo

djghosh13 · 2022-12-31T23:09:23Z

Hm, yeah, I guess do we want to use the same procedure for wds? Currently, the wds loader expects a classnames.txt and zeroshot_classification_templates.txt in the same folder/HF repo as the data.

rom1504 · 2023-01-01T00:16:43Z

I think it makes sense to put in HF/wds the same as in the original source, so the English classnames and prompts for most datasets
And to put new content that we provide from other ways (so for example that's multilingual classnames/prompt lists) directly in clip benchmark

Can think of it as an override

The reasoning is that we cannot add more things in the original source, so doing it that way (rather than adding more languages in all the wds) will keep the source of truth in a single place (clip benchmark repo) for both original source and wds formats.

rom1504 · 2023-02-03T23:36:59Z

done

rom1504 mentioned this issue Dec 13, 2022

Added support for loading webdatasets #47

Merged

rom1504 mentioned this issue Dec 21, 2022

Evaluate multiple models/datasets/languages using the CLI directly #56

Merged

rom1504 mentioned this issue Dec 26, 2022

put all datasets in hugging face datasets #4

Closed

djghosh13 mentioned this issue Jan 31, 2023

Webdataset updates #75

Merged

rom1504 closed this as completed Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

put all datasets in wds #52

put all datasets in wds #52

rom1504 commented Dec 13, 2022

djghosh13 commented Dec 14, 2022

djghosh13 commented Dec 18, 2022

rom1504 commented Dec 26, 2022

usuyama commented Dec 30, 2022

djghosh13 commented Dec 31, 2022

rom1504 commented Dec 31, 2022 •

edited

Loading

djghosh13 commented Dec 31, 2022

rom1504 commented Jan 1, 2023 •

edited

Loading

rom1504 commented Feb 3, 2023

put all datasets in wds #52

put all datasets in wds #52

Comments

rom1504 commented Dec 13, 2022

djghosh13 commented Dec 14, 2022

djghosh13 commented Dec 18, 2022

rom1504 commented Dec 26, 2022

usuyama commented Dec 30, 2022

djghosh13 commented Dec 31, 2022

rom1504 commented Dec 31, 2022 • edited Loading

djghosh13 commented Dec 31, 2022

rom1504 commented Jan 1, 2023 • edited Loading

rom1504 commented Feb 3, 2023

rom1504 commented Dec 31, 2022 •

edited

Loading

rom1504 commented Jan 1, 2023 •

edited

Loading