-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix nccl test eks #455
base: main
Are you sure you want to change the base?
fix nccl test eks #455
Conversation
@@ -39,10 +39,6 @@ spec: | |||
- -x | |||
- FI_PROVIDER=efa | |||
- -x | |||
- FI_EFA_USE_DEVICE_RDMA=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep those variable to be consistent with slurm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roywei do you plan to update this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mhuguesaws thanks for the review and sorry for late reply. I think these flags are not needed anymore per OFI docs, no matter if this is on slurm or k8s. Also if you set FI_EFA_USE_DEVICE_RDMA
on some old p3 instances with new OFI and libfabric version if may cause crash. See here https://github.com/aws/aws-ofi-nccl/blob/master/doc/efa-env-var.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is specific to p5. Please keep as it is or make it consistent in Slurm as well. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if i missed anything about the flags. I can also update the slurm nccl-tests flags if needed
@@ -39,10 +39,6 @@ spec: | |||
- -x | |||
- FI_PROVIDER=efa | |||
- -x | |||
- FI_EFA_USE_DEVICE_RDMA=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mhuguesaws thanks for the review and sorry for late reply. I think these flags are not needed anymore per OFI docs, no matter if this is on slurm or k8s. Also if you set FI_EFA_USE_DEVICE_RDMA
on some old p3 instances with new OFI and libfabric version if may cause crash. See here https://github.com/aws/aws-ofi-nccl/blob/master/doc/efa-env-var.md
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.