Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation over time on Frontier #1218

Open
5 tasks
pgrete opened this issue Dec 23, 2024 · 1 comment
Open
5 tasks

Performance degradation over time on Frontier #1218

pgrete opened this issue Dec 23, 2024 · 1 comment
Labels
bug Something isn't working performance

Comments

@pgrete
Copy link
Collaborator

pgrete commented Dec 23, 2024

Here's a representative performance plot on Frontier (1068 nodes) from AthenaPK for a simulation SMR (thus constant workload per cycle) showing wall seconds per cycle versus cycle.

image

Restart within a single job "resets" the performance, so it's nothing general wrt individual node performance.

I'm reporting here because (as discussed during sync) @bprather has also seen this (on smaller scales, like 8 nodes), so it's likely related to Parthenon and/or Frontier and not AthenaPK itself.

For reference:
AthenaPK Dec24 is f8497c5 (with associated submodule)
AthenaPK Oct23 is 3ce0a88

Stack Mid24 is

module restore
module load cpe/24.07 PrgEnv-amd cray-mpich/8.1.30 craype-accel-amd-gfx90a amd/6.2.0 rocm/6.2.0
module load cmake cray-python
module unload darshan-runtime

export LD_LIBRARY_PATH=${CRAY_LD_LIBRARY_PATH}:${LD_LIBRARY_PATH}
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_GPU_IPC_CACHE_MAX_SIZE=1000
export MPICH_MPIIO_HINTS="*:romio_cb_write=disable"
export FI_MR_CACHE_MONITOR=disable

Stack Dec24 is

module load PrgEnv-cray
module load cpe/23.12
module load craype-accel-amd-gfx90a
module load rocm/5.7.1
module load cray-mpich
module load cce/17.0.0
module load cray-hdf5-parallel/1.14.3.1

export MPICH_GPU_SUPPORT_ENABLED=1
export FI_CXI_RX_MATCH_MODE=software
export MPICH_MPIIO_HINTS="*:romio_cb_write=disable" 

Next year todos:

  • smaller reproducer (ideally in Parthenon)
  • check if this tied to
    • is environment vars
    • cray versus amd module
  • potentially bisect Parthenon between known good and current version
@pgrete pgrete added bug Something isn't working performance labels Dec 23, 2024
@BenWibking
Copy link
Collaborator

Does it reproduce with the older cray-mpich and the new binary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

2 participants