You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying to run a number of the python examples (at least the mnist and cifar examples) results in the following error:
[0 - 400036f6f840] 3.522337 {5}{runtime}: [error 67] LEGION ERROR: Mapper FlexFlow Mapper selected a concurrent variant 2 for point task unnamed_task_149 (UID 128) of a concurrent task launch but selected a different concurrent variant 1 for a different point task. All point tasks in a concurrent index task launch must use the same concurrent task variant. (from file /vast/home/stelleg/hpai/flexflow/deps/legion/runtime/legion/legion_tasks.cc:7506)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_67
*** Caught a fatal signal (proc 0): SIGABRT(6)
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
WARNING: ODP shutdown in signal context
Failed, NCCL error /vast/home/stelleg/hpai/flexflow/src/runtime/model.cc:604 'unhandled system error (run with NCCL_DEBUG=INFO for details)'
../../../build-aarch64/flexflow_python: line 17: 575895 Aborted $BUILD_FOLDER/deps/legion/bin/legion_python "${legion_python_args[@]}"
The examples runs fine on a single node.
This is being run on 2 grace hopper nodes with gasnetex+ibv network.
I've attached a log with GASNET_BACKTRACE=1 and NCCL_DEBUG=INFO mnist_cnn.log
The text was updated successfully, but these errors were encountered:
I've also run some of the CPP examples just fine on both nodes, e.g. resnet and resnext50, so I expect it has something to do with the python interface.
Trying to run a number of the python examples (at least the mnist and cifar examples) results in the following error:
The examples runs fine on a single node.
This is being run on 2 grace hopper nodes with gasnetex+ibv network.
I've attached a log with
GASNET_BACKTRACE=1
andNCCL_DEBUG=INFO
mnist_cnn.log
The text was updated successfully, but these errors were encountered: