Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU stalling during validation #58

Open
farzadab opened this issue Jul 25, 2024 · 0 comments
Open

GPU stalling during validation #58

farzadab opened this issue Jul 25, 2024 · 0 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@farzadab
Copy link
Contributor

farzadab commented Jul 25, 2024

Depending on the dataset used, our validation gets stuck when using DDP on 8 GPUs.

This seems to happen pretty consistently when the said dataset has 1 shard. Usually re-uploading the dataset with 8 shards (or higher?) seems to resolve the issue, but the cause is still unknown.

A direct mitigation for this issue was to remove the matchtrain validation set. This is not ideal as there's no easy way to check for overfitting.

@farzadab farzadab added bug Something isn't working help wanted Extra attention is needed labels Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant