Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multinode python: Legion error 67 alongside NCCL errors. #1480

Closed
stelleg opened this issue Aug 30, 2024 · 5 comments · Fixed by #1509
Closed

multinode python: Legion error 67 alongside NCCL errors. #1480

stelleg opened this issue Aug 30, 2024 · 5 comments · Fixed by #1509
Labels
inference Features and fixes related to the inference project.

Comments

@stelleg
Copy link
Collaborator

stelleg commented Aug 30, 2024

Trying to run a number of the python examples (at least the mnist and cifar examples) results in the following error:

[0 - 400036f6f840]    3.522337 {5}{runtime}: [error 67] LEGION ERROR: Mapper FlexFlow Mapper selected a concurrent variant 2 for point task unnamed_task_149 (UID 128) of a concurrent task launch but selected a different concurrent variant 1 for a different point task. All point tasks in a concurrent index task launch must use the same concurrent task variant. (from file /vast/home/stelleg/hpai/flexflow/deps/legion/runtime/legion/legion_tasks.cc:7506)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_67

*** Caught a fatal signal (proc 0): SIGABRT(6)
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
WARNING: ODP shutdown in signal context
Failed, NCCL error /vast/home/stelleg/hpai/flexflow/src/runtime/model.cc:604 'unhandled system error (run with NCCL_DEBUG=INFO for details)'
../../../build-aarch64/flexflow_python: line 17: 575895 Aborted                 $BUILD_FOLDER/deps/legion/bin/legion_python "${legion_python_args[@]}"

The examples runs fine on a single node.

This is being run on 2 grace hopper nodes with gasnetex+ibv network.

I've attached a log with GASNET_BACKTRACE=1 and NCCL_DEBUG=INFO
mnist_cnn.log

@stelleg
Copy link
Collaborator Author

stelleg commented Aug 30, 2024

I've also run some of the CPP examples just fine on both nodes, e.g. resnet and resnext50, so I expect it has something to do with the python interface.

@lockshaw
Copy link
Collaborator

lockshaw commented Sep 1, 2024

Which branch is this on?

@stelleg stelleg added the inference Features and fixes related to the inference project. label Sep 3, 2024
@stelleg
Copy link
Collaborator Author

stelleg commented Sep 3, 2024

inference

@lockshaw
Copy link
Collaborator

lockshaw commented Sep 3, 2024

@goliaro @suranap @jiazhihao Thoughts?

@jiazhihao
Copy link
Collaborator

Based on the log, it seems the issue is during model compilation (before training starts). We are looking into it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inference Features and fixes related to the inference project.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants