You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using the mpich build 'mpich/20231026/icc-all-pmix-gpu' on sunspot and using darshan and vtune for performance analysis I am seeing what appears to be very bad performance in the messaging layer for the ROMIO collective buffering aggregation. I am usng the HDF5 h5bench exerciser benchmark which uses collective MPI-IO for the backend. This is just on 1 node so just intra-node communication, looking at darshan for example using 2 ranks I see:
Time is in seconds, the total mpi-io time is 0.79 sec and within that the posix (lustre io) time is only 0.27 sec to write and then 0.10 sec to read (if doing rmw) so the delta is most likely messaging layer, and with 16 ranks it gets much worse:
So for 16 ranks the ratio is alot higher for the messaging layer time in mpiio. HDF5 is using collective mpi-io aggregation so there is a POSIX section which has all the times for the actual lustre filesystem interaction and then an MPIIO section with times that include all the messaging and the POSIX time, so taking the delta between them roughly gives the messaging time for aggregation. With Vtune I can see most all the time for mpi-io writing (MPI_File_write_at_all) is in ofi. So for 1 node 16 ranks the question is for MPIIO out of 37.61 secodes only 2.6 seconds are spent writing to lustre leaving over 35 seconds doing what I presume is mpi communication for the aggregation. To reproduce on sunspot running against lustre (gila):
Start interactive job on 1 node: qsub -lwalltime=60:00 -lselect=1 -A Aurora_deployment -q workq -I
Talked to @pkcoff offline. He's going to test without unsetting the collective tuning file envvars to see if there's any impact on the performance and report back.
@raffenet advised me to NOT unset the collective tuning json vars, I did so: pkcoff@x1921c3s4b0n0:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir> echo $MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE /soft/restricted/CNDA/updates/mpich/tuning/20230818-1024/CH4_coll_tuning.json pkcoff@x1921c3s4b0n0:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir> echo $MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE /soft/restricted/CNDA/updates/mpich/20231026/mpich-ofi-all-icc-default-pmix-gpu-drop20231026/json-files/POSIX_coll_tuning.json pkcoff@x1921c3s4b0n0:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir> echo $MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE /soft/restricted/CNDA/updates/mpich/tuning/20230818-1024/MPIR_Coll_tuning.json pkcoff@x1921c3s4b0n0:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir>
But that didn't impact performance at all - I still see a ton of time in what I think is MPIIO aggregation per the darshan results: POSIX -1 9899278149064962649 POSIX_F_READ_TIME 0.829620 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-2129464776 /lus/gila lustre POSIX -1 9899278149064962649 POSIX_F_WRITE_TIME 1.672418 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-2129464776 /lus/gila lustre P MPI-IO -1 9899278149064962649 MPIIO_F_WRITE_TIME 36.682667 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-2129464776 /lus/gila lustre M
At the advice of Ken I also unset all these: unset MPIR_CVAR_ENABLE_GPU unset MPIR_CVAR_BCAST_POSIX_INTRA_ALGORITHM unset MPIR_CVAR_ALLREDUCE_POSIX_INTRA_ALGORITHM unset MPIR_CVAR_BARRIER_POSIX_INTRA_ALGORITHM unset MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM
But performance was actually a bit worse: POSIX -1 5703841420005174308 POSIX_F_READ_TIME 0.774989 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-1873931539 /lus/gila lustre POSIX -1 5703841420005174308 POSIX_F_WRITE_TIME 1.770138 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-1873931539 /lus/gila lustre MPI-IO -1 5703841420005174308 MPIIO_F_WRITE_TIME 38.300559 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-1873931539 /lus/gila lustre
Using the mpich build 'mpich/20231026/icc-all-pmix-gpu' on sunspot and using darshan and vtune for performance analysis I am seeing what appears to be very bad performance in the messaging layer for the ROMIO collective buffering aggregation. I am usng the HDF5 h5bench exerciser benchmark which uses collective MPI-IO for the backend. This is just on 1 node so just intra-node communication, looking at darshan for example using 2 ranks I see:
Time is in seconds, the total mpi-io time is 0.79 sec and within that the posix (lustre io) time is only 0.27 sec to write and then 0.10 sec to read (if doing rmw) so the delta is most likely messaging layer, and with 16 ranks it gets much worse:
So for 16 ranks the ratio is alot higher for the messaging layer time in mpiio. HDF5 is using collective mpi-io aggregation so there is a POSIX section which has all the times for the actual lustre filesystem interaction and then an MPIIO section with times that include all the messaging and the POSIX time, so taking the delta between them roughly gives the messaging time for aggregation. With Vtune I can see most all the time for mpi-io writing (MPI_File_write_at_all) is in ofi. So for 1 node 16 ranks the question is for MPIIO out of 37.61 secodes only 2.6 seconds are spent writing to lustre leaving over 35 seconds doing what I presume is mpi communication for the aggregation. To reproduce on sunspot running against lustre (gila):
Start interactive job on 1 node:
qsub -lwalltime=60:00 -lselect=1 -A Aurora_deployment -q workq -I
Then to get the darshan text file run this:
The text was updated successfully, but these errors were encountered: