Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nccl test eks #455

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

fix nccl test eks #455

wants to merge 1 commit into from

Conversation

roywei
Copy link

@roywei roywei commented Oct 9, 2024

Issue #, if available:

Description of changes:

  1. Fixed image pull policy config, to resolve this error during deployment
Error from server (BadRequest): error when creating "nccl-tests.yaml": MPIJob in version "v2beta1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.mpiReplicaSpecs.Launcher.template.spec.imagePullPolicy", unknown field "spec.mpiReplicaSpecs.Worker.template.spec.imagePullPolicy"  yaml file apiVersion: 
  1. removed unnecessary flags for EFA, these are auto-set in OFI plugin now.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -39,10 +39,6 @@ spec:
- -x
- FI_PROVIDER=efa
- -x
- FI_EFA_USE_DEVICE_RDMA=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep those variable to be consistent with slurm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roywei do you plan to update this PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mhuguesaws thanks for the review and sorry for late reply. I think these flags are not needed anymore per OFI docs, no matter if this is on slurm or k8s. Also if you set FI_EFA_USE_DEVICE_RDMA on some old p3 instances with new OFI and libfabric version if may cause crash. See here https://github.com/aws/aws-ofi-nccl/blob/master/doc/efa-env-var.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is specific to p5. Please keep as it is or make it consistent in Slurm as well. Thank you

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments

Copy link
Author

@roywei roywei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if i missed anything about the flags. I can also update the slurm nccl-tests flags if needed

@@ -39,10 +39,6 @@ spec:
- -x
- FI_PROVIDER=efa
- -x
- FI_EFA_USE_DEVICE_RDMA=1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mhuguesaws thanks for the review and sorry for late reply. I think these flags are not needed anymore per OFI docs, no matter if this is on slurm or k8s. Also if you set FI_EFA_USE_DEVICE_RDMA on some old p3 instances with new OFI and libfabric version if may cause crash. See here https://github.com/aws/aws-ofi-nccl/blob/master/doc/efa-env-var.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants