Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unweighted Graphs Error with RBCs on ARCHER2 #815

Open
c-denham opened this issue Mar 1, 2024 · 7 comments
Open

Unweighted Graphs Error with RBCs on ARCHER2 #815

c-denham opened this issue Mar 1, 2024 · 7 comments
Assignees

Comments

@c-denham
Copy link

c-denham commented Mar 1, 2024

Hello,

I have ran a test case with RBCs on ARCHER2 and have copied the slurm.out below.

![0.0s]Reading configuration from /work/e283/e283/cd3nham/config_files/rbc_tests/test_clipped.xml
![0.0s]RBC insertion random seed: 0x17b81088669b7379
![0.0s]Krueger format meshes are deprecated, move to VTK when you can.
![0.0s]Beginning Initialisation.
![0.0s]Loading and decomposing geometry file /work/e283/e283/cd3nham/config_files/rbc_tests/test_clipped.gmy.
![0.0s]Opened config file /work/e283/e283/cd3nham/config_files/rbc_tests/test_clipped.gmy
![0.1s]Creating block-level octree
![0.1s]Beginning initial decomposition
![0.9s]Optimising the domain decomposition.
![4.1s]Initialising domain.
![4.1s]Processing sites assigned to each MPI process
![4.3s]Assigning local indices to sites and associated data
![4.3s]Initialising neighbour lookups
![5.0s]Initialising field data.
![5.0s]Initialising neighbouring data manager.
![5.0s]Initialising LBM.
![5.0s]Initialising RBCs.
![5.0s]Krueger format meshes are deprecated, move to VTK when you can.
![5.0s]Computing which ranks are within a cell's size
![160.2s]Checking the neighbourhoods are self-consistent
![316.0s]Create the graph communicator
![316.0s]Creating coordinate to rank map
![317.2s]Beginning to run simulation.
[Rank 0000000, 317.2s, mem: 0052484]: Only support unweighted graphs
MPICH ERROR [Rank 0] [job id 5754290.0] [Wed Feb 28 15:35:53 2024] [nid003862] - Abort(-1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[Rank 0000001, 317.2s, mem: 0058032]: Only support unweighted graphs
MPICH ERROR [Rank 1] [job id 5754290.0] [Wed Feb 28 15:35:53 2024] [nid003862] - Abort(-1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
MPICH ERROR [Rank 2] [job id 5754290.0] [Wed Feb 28 15:35:53 2024] [nid003862] - Abort(-1) (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[Rank 0000003, 2.4s, mem: 0031436]: ParMetis cut 15884 edges.
[Rank 0000003, 317.2s, mem: 0050612]: Only support unweighted graphs
MPICH ERROR [Rank 3] [job id 5754290.0] [Wed Feb 28 15:35:53 2024] [nid003862] - Abort(-1) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
[Rank 0000002, 317.2s, mem: 0061888]: Only support unweighted graphs
srun: error: nid003862: tasks 0-3: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=5754290.0

I do not have a copy of the output from when we compiled it @rupertnash last week but I recall the unweighted graphs error appearing during compiling but not during the fluid only test case.
Many thanks in advance for your advice.

@mobernabeu
Copy link
Contributor

This seems to be as if MpiCommunicator::DistGraphAdjacent is attempting to create an unweighted graph, but then when the graph is queried in MpiCommunicator::GetNeighborsCount, it appears to be weighted.

This may be an MPI implementation issue or us not using the creation interface correctly. It needs further investigation.

@mobernabeu
Copy link
Contributor

mobernabeu commented Mar 15, 2024

A bit more digging shows that this (otherwise very sensible) check for weighted/unweighted only appeared when we moved from graph to distributed graph: 9689943

I don't see what we could be doing wrong in the distributed graph creation, so my suggestion would be to refine the logic of the check to assert that all weights are equal when the graph wrongly believes to be weighted (implementation issue?). Any thoughts @rupertnash ?

@rupertnash
Copy link
Member

Can you put the case in the shared folder (/work/e283/e283/shared) so I can reproduce?

I'd like to run under debugger to investigate as this may be a bug in the MPI library (standard is clear on what should happen "false if MPI_UNWEIGHTED was supplied during creation, true otherwise"

@c-denham
Copy link
Author

I have copied the config_files folder over to the shared space that should include everything you need @rupertnash to reproduce

@mobernabeu
Copy link
Contributor

Hi @rupertnash, I was trying to help @c-denham make a bit of progress on this issue by investigating whether we can use any alternative MPI implementation potentially available in ARCHER2 (e.g. Open MPI). Looking through module avail doesn't show anything obvious. Do you know if there's one?

I also tried swapping the default programming environment from gnu to PrgEnv-cray or PrgEnv-aocc in the hope that other versions of MPICH might be built there, but that seems broken at the system level currently.

@rupertnash
Copy link
Member

So I have investigated and think this is likely to be a bug in the MPI library. I've reported to Helpdesk who've passed to HPE's MPICH team. They have reproduced the behaviour in HemeLB and are trying to understand the problem.

I did not trigger the bug when running on a larger number of processors however, so maybe try that? Disabling the check is maybe OK, although if the communicators have been corrupted somehow (whether internally or by hemelb) then things may go wrong later...

@mobernabeu
Copy link
Contributor

Thanks for investigating further, @rupertnash. We are gonna try running with a larger core count. How many did you got for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants