Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webdataset updates #75

Merged
merged 7 commits into from
Feb 3, 2023
Merged

Webdataset updates #75

merged 7 commits into from
Feb 3, 2023

Conversation

djghosh13
Copy link
Contributor

(Mostly) addresses issues #52 and #67

  • Support for conversion and evaluation of retrieval datasets (clip_benchmark_export_wds --retrieval)
  • More complete API for converting other classification/retrieval datasets, with import clip_benchmark.webdataset_builder)
  • (Not in commit) All VTAB+ and retrieval datasets have been uploaded to HF
  • Readmes have been updated appropriately, default suggestion in benchmark/README.md is to use webdataset

Not completed:

  • Multilingual support, i.e., overriding default classnames & templates with other languages
  • Converting voc2007_multilabel

@djghosh13
Copy link
Contributor Author

Note: I've only tested the benchmark code with a single model, so I haven't run the complete experiments. There are some minor differences in numbers from what @rom1504 gave me, but no differences from what I get with the original datasets when I run them myself.

@rom1504
Copy link
Contributor

rom1504 commented Feb 1, 2023

Nice, will check it out

@mehdidc
Copy link
Collaborator

mehdidc commented Feb 1, 2023

Really cool, thanks @djghosh13! For differences in numbers, might be related to this issue #59

@djghosh13
Copy link
Contributor Author

I see, yeah, I think the differences I saw were also in the 0.001s range.

@rom1504
Copy link
Contributor

rom1504 commented Feb 2, 2023

Are datasets getting cached locally? Where and is it tweakable ?

@djghosh13
Copy link
Contributor Author

I forgot to add this to the readme. By default, no, but there is a new --wds_cache_dir parameter in the CLI which is directly passed to Webdataset(cache_dir=) if a path is given.

@djghosh13
Copy link
Contributor Author

I hadn't actually tested it before, but it looks like it will save the .tar files inside the specified cache directory in a subdirectory that's named like datasets_clip-benchmark_wds_vtab-cifar10_resolve_main_test (for example)

@rom1504
Copy link
Contributor

rom1504 commented Feb 2, 2023

ok if it doesn't by default, it's good

@rom1504
Copy link
Contributor

rom1504 commented Feb 2, 2023

I'll test this

@rom1504
Copy link
Contributor

rom1504 commented Feb 3, 2023

ok so one thing here
not caused by your pr, but I really think we should put the main eval command in the readme at the beginning, not hidden in benchmark/ folder

@rom1504
Copy link
Contributor

rom1504 commented Feb 3, 2023

#76 a minor point, but quite nice for UX

@rom1504
Copy link
Contributor

rom1504 commented Feb 3, 2023

yeah this is much faster than the file based option

will run to the end and compare numbers, if all good will merge

@rom1504
Copy link
Contributor

rom1504 commented Feb 3, 2023

ok it does work, let's go

@rom1504 rom1504 merged commit 2de524d into LAION-AI:main Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants