You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Depending on the dataset used, our validation gets stuck when using DDP on 8 GPUs.
This seems to happen pretty consistently when the said dataset has 1 shard. Usually re-uploading the dataset with 8 shards (or higher?) seems to resolve the issue, but the cause is still unknown.
A direct mitigation for this issue was to remove the matchtrain validation set. This is not ideal as there's no easy way to check for overfitting.
The text was updated successfully, but these errors were encountered:
Depending on the dataset used, our validation gets stuck when using DDP on 8 GPUs.
This seems to happen pretty consistently when the said dataset has 1 shard. Usually re-uploading the dataset with 8 shards (or higher?) seems to resolve the issue, but the cause is still unknown.
A direct mitigation for this issue was to remove the
matchtrain
validation set. This is not ideal as there's no easy way to check for overfitting.The text was updated successfully, but these errors were encountered: