Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

put all datasets in wds #52

Closed
rom1504 opened this issue Dec 13, 2022 · 9 comments
Closed

put all datasets in wds #52

rom1504 opened this issue Dec 13, 2022 · 9 comments

Comments

@rom1504
Copy link
Contributor

rom1504 commented Dec 13, 2022

follow up of #47

  • put all datasets at hf
  • then to reference all the datasets links in a file
  • and adapt run.sh to have an option to use webdataset as a source, so it works for all datasets
  • important to have enough shards (at least 4)
@djghosh13
Copy link
Contributor

Will leave a note here to also add an option to download (cache) when loading from HF. Mainly, I need to add another CLI option/parameter somewhere, since both source URL (currently specified by --dataset_root) and destination download path should be specified.

@djghosh13
Copy link
Contributor

Another related note: adding some form of support for different languages of classnames and templates to avoid needing to duplicate the whole dataset for multilingual eval

@rom1504
Copy link
Contributor Author

rom1504 commented Dec 26, 2022

@djghosh13 would definitely be great to do this issue if you're still interested

@usuyama
Copy link

usuyama commented Dec 30, 2022

+1 for wds!

I tried HF datasets for images before, but somehow didn't like that much. Maybe because Arrow wasn't very flexible / intuitive for images.

@djghosh13
Copy link
Contributor

Definitely still interested! What are your thoughts on the implementation of this point?

Another related note: adding some form of support for different languages of classnames and templates to avoid needing to duplicate the whole dataset for multilingual eval

Everything else should be straightforward once I actually get around to it.

@rom1504
Copy link
Contributor Author

rom1504 commented Dec 31, 2022

We already have support for that, check the code

Currently the path that was taken is to put the other languages prompt and classnames directly in this repo

@djghosh13
Copy link
Contributor

Hm, yeah, I guess do we want to use the same procedure for wds? Currently, the wds loader expects a classnames.txt and zeroshot_classification_templates.txt in the same folder/HF repo as the data.

@rom1504
Copy link
Contributor Author

rom1504 commented Jan 1, 2023

I think it makes sense to put in HF/wds the same as in the original source, so the English classnames and prompts for most datasets
And to put new content that we provide from other ways (so for example that's multilingual classnames/prompt lists) directly in clip benchmark

Can think of it as an override

The reasoning is that we cannot add more things in the original source, so doing it that way (rather than adding more languages in all the wds) will keep the source of truth in a single place (clip benchmark repo) for both original source and wds formats.

@rom1504
Copy link
Contributor Author

rom1504 commented Feb 3, 2023

done

@rom1504 rom1504 closed this as completed Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants