Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALBEDO FESOM2 writing 3d output stream issue #396

Open
patrickscholz opened this issue Dec 15, 2022 · 18 comments
Open

ALBEDO FESOM2 writing 3d output stream issue #396

patrickscholz opened this issue Dec 15, 2022 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@patrickscholz
Copy link
Contributor

There seems to be an issue on ALBEDO when writing 3D output variables! The 3d output seems to be written into file, at least the written 3d files are full

-rw-r--r-- 1 pscholz hpc_user 574M Dec 15 17:17 u.fesom.1958.nc

...but afterwards the stream seems to become stucked at...

initializing I/O file for u
associating mean I/O file /albedo/work/user/pscholz/results/test_dart_linfs_pc0/chain/u.fesom.1958.nc
u: current mean I/O counter =            1
writing mean record for u; rec. count =            1
run: Job step aborted: Waiting up to 62 seconds for job step to finish.
forrtl: error (78): process killed (SIGTERM)

This only happens when defining output streams for 3d variables, as long as i only define 2d variables the model output seems to be fine.

  • NEC approved compiler settings on albedo:
-march=core-avx2 -O3 -ip -fPIC -qopt-malloc-options=2 -qopt-prefetch=5 -unroll-aggressive
  • Modules and environment settings for albedo are:
# make the contents as shell agnostic as possible so we can include them with bash, zsh and others
module load intel-oneapi-compilers 
export FC="mpiifort -qmkl" CC=mpiicc CXX=mpiicpc
module load intel-oneapi-mpi/2021.6.0
module load intel-oneapi-mkl/2022.1.0
module load netcdf-fortran/4.5.4-intel-oneapi-mpi2021.6.0-oneapi2022.1.0
module load netcdf-c/4.8.1-intel-oneapi-mpi2021.6.0-oneapi2022.1.0
# from DKRZ recommented environment variables on levante
# (https://docs.dkrz.de/doc/levante/running-jobs/runtime-settings.html) 
export HCOLL_ENABLE_MCAST_ALL="0"
export HCOLL_MAIN_IB=mlx5_0:1
export UCX_IB_ADDR_TYPE=ib_global
export UCX_NET_DEVICES=mlx5_0:1
export UCX_TLS=mm,knem,cma,dc_mlx5,dc_x,self # this line here brings the most speedup factor ~1.5
export UCX_UNIFIED_MODE=y
export UCX_HANDLE_ERRORS=bt
export HDF5_USE_FILE_LOCKING=FALSE
export I_MPI_PMI=pmi2
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so
  • with these compiler options and environment variables we reach on albedo (neglecting the output) the same performance as on levante

  • Things compile fine, no error message can be triggered !

  • I tried with and without asynchronous I/O (DISABLE_MULTITHREADING ON/OFF) both times the same problem

Anybody (@hegish ) an idea what could be the problem?

@patrickscholz patrickscholz added the bug Something isn't working label Dec 15, 2022
@patrickscholz
Copy link
Contributor Author

OK ?! it seems to be also mesh dependent:
3D output streams for core mesh (127K) works but not for much bigger dart mesh (3.1M, partitioned for 48nodes 127CPUs)

@pgierz
Copy link
Member

pgierz commented Dec 15, 2022

Hi Patrick, I'll look into this. Do you have a template compile script somewhere if I need to play around with it?

@pgierz
Copy link
Member

pgierz commented Dec 15, 2022

Nevermind, already posted. Sorry for not seeing that ;-)

@patrickscholz
Copy link
Contributor Author

patrickscholz commented Dec 15, 2022

You can basically use the branch refactoring_albedo_env, there i played around (compiler options, environment variables, albedo job scripts ...) to make FESOM2 work on albedo

@pgierz
Copy link
Member

pgierz commented Dec 16, 2022

Are you sure everything is checked in, Patrick? I'd like to directly reproduce your problem on my account, but your job script links in a namelist.config with paths pointing to ollie. I can modify it to work, but I would need to know where you stored your mesh ;-)

@patrickscholz
Copy link
Contributor Author

Ahh, no i did not edit the default namelists for albedo!!! But you can take them directly from my albedo directory /albedo/home/pscholz/fesom2_refactoring/work_dart_pc0. I only edited the environment files and the job scipts for albedo use.

@pgierz
Copy link
Member

pgierz commented Dec 16, 2022

OK, I found the log files through slurm. Can you please try this also using -g -traceback? It seems as if your MPI crashes, but I cannot see any other interesting info yet. I will keep looking:

From /albedo/home/pscholz/fesom2_refactoring/work_dart_pc0/fesom2_dart6096_test_srelax:0_754731.out

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
fesom.x            000000000069D0FB  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  0000155548ECACE0  Unknown               Unknown  Unknown
libmpi.so.12.0.0   0000155549D85FAB  Unknown               Unknown  Unknown

@pgierz
Copy link
Member

pgierz commented Dec 16, 2022

First glance also seems to point to out-of-memory or invalid memory access error...? If it works in the smaller case but not in the larger one.

@patrickscholz
Copy link
Contributor Author

OK thats different, i always used the full debug options -g -traceback -check all,noarg_temp_created,bounds,uninit there got stucked at a much earlier point that i think is not related to that problem. Because i get stucked after he already wrote data to file ...

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable AUX_R4 when it is not allocated

Image              PC                Routine            Line        Source             
fesom.x            000000000266060F  Unknown               Unknown  Unknown
fesom.x            00000000014EC4E8  io_meandata_mp_wr         818  io_meandata.F90
fesom.x            00000000014F54C4  io_meandata_mp_do        1034  io_meandata.F90
fesom.x            0000000000490D92  async_threads_exe         104  async_threads_module.F90
fesom.x            0000000000490663  async_threads_mod          74  async_threads_module.F90
fesom.x            00000000014F503B  io_meandata_mp_ou        1014  io_meandata.F90
fesom.x            00000000005EECCC  fesom_module_mp_f         378  fesom_module.F90
fesom.x            0000000002618A89  MAIN__                     15  fesom_main.F90

@patrickscholz
Copy link
Contributor Author

patrickscholz commented Dec 16, 2022

@pgierz I just tried only -g -traceback i cant trigger any error message. Is the error message you found simply from exceeding the wall clock time limit and the node killing off everything?

@pgierz
Copy link
Member

pgierz commented Dec 16, 2022

No, I was just trying to get to the root of this unknown stuff, but of course that means I need to compile a MPI library with traceback, not just FESOM, so I was too quick with my comment in that case...

@trackow
Copy link
Contributor

trackow commented Dec 16, 2022

The symptoms remind me a bit of #173 , but probably not related? Output was either hanging or super slow in my original runs. @hegish will know

@pgierz
Copy link
Member

pgierz commented Dec 16, 2022

@patrickscholz: Then maybe it is indeed some kind of an allocation error. So if understand correctly:

Case 1: you compile with full debug flags for FESOM, and then get a complaint:

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable AUX_R4 when it is not allocated

Here's the variable it is complaining about:

real(real32), allocatable :: aux_r4(:)

It also seems to be allocated here:

if(.not. allocated(entry%aux_r4)) allocate(entry%aux_r4(size2))

Case 2: you compile with NEC settings, and then run into a segmentation fault.

Maybe something is messing up in the allocation if you have a lot more MPI tasks?

@patrickscholz
Copy link
Contributor Author

Im not sure ... this allocation should happen before you write and not afterwards, since the data are already stored in file when he hangs up. I have the impression this is another issue

@patrickscholz
Copy link
Contributor Author

patrickscholz commented Dec 20, 2022

@hegish & @pgierz i further identified the issue, which is here

call assert_nf( nf_put_vara_real(entry%ncid, entry%varID, (/1, lev, entry%rec_count/), (/size2, 1, 1/), entry%aux_r4, 1), __LINE__)

fesom2/src/io_meandata.F90

Lines 808 to 830 in 0d7d80d

else if (entry%accuracy == i_real4) then
if(entry%p_partit%mype==entry%root_rank) then
if(.not. allocated(entry%aux_r4)) allocate(entry%aux_r4(size2))
end if
do lev=1, size1
#ifdef ENABLE_ALEPH_CRAYMPICH_WORKAROUNDS
! aleph cray-mpich workaround
call MPI_Barrier(entry%comm, mpierr)
#endif
if(.not. entry%is_elem_based) then
call gather_real4_nod2D (entry%local_values_r4_copy(lev,1:size(entry%local_values_r4_copy,dim=2)), entry%aux_r4, entry%root_rank, tag, entry%comm, entry%p_partit)
else
call gather_real4_elem2D(entry%local_values_r4_copy(lev,1:size(entry%local_values_r4_copy,dim=2)), entry%aux_r4, entry%root_rank, tag, entry%comm, entry%p_partit)
end if
if (entry%p_partit%mype==entry%root_rank) then
if (entry%ndim==1) then
call assert_nf( nf_put_vara_real(entry%ncid, entry%varID, (/1, entry%rec_count/), (/size2, 1/), entry%aux_r4, 1), __LINE__)
elseif (entry%ndim==2) then
call assert_nf( nf_put_vara_real(entry%ncid, entry%varID, (/1, lev, entry%rec_count/), (/size2, 1, 1/), entry%aux_r4, 1), __LINE__)
end if
end if
end do
end if

every time the model writes a 2d slice of the 3d data via call assert_nf(...) the model needs longer and longer to write that 2d data slice until it looks like the model hang up. No Idea whats the exact cause ...

u: current mean I/O counter =            1
 writing mean record for u; rec. count =            1
  --> call update_atm_forcing(n)
  --> call ice_timestep(n)
      --> call EVPdynamics...
                                     root_rank       lvl   time(sec)
  -I/O-> after nf_put_vara_real        3810           1  0.905330052581121     
  -I/O-> after nf_put_vara_real        3810           2   2.01729261684522     
  -I/O-> after nf_put_vara_real        3810           3   3.03888718470262     
  -I/O-> after nf_put_vara_real        3810           4   4.41701754259338     
  -I/O-> after nf_put_vara_real        3810           5   6.05037017439099     
  -I/O-> after nf_put_vara_real        3810           6   7.84644225670127     
  -I/O-> after nf_put_vara_real        3810           7   9.91847573239647     
  -I/O-> after nf_put_vara_real        3810           8   12.1049769922247     
  -I/O-> after nf_put_vara_real        3810           9   14.2242100060575     
  -I/O-> after nf_put_vara_real        3810          10   16.4367477671913     

... But it looks like this problem can be solved by using Jans workaround for ALEPH also on ALBEDO

fesom2/src/io_meandata.F90

Lines 813 to 816 in 0d7d80d

#ifdef ENABLE_ALEPH_CRAYMPICH_WORKAROUNDS
! aleph cray-mpich workaround
call MPI_Barrier(entry%comm, mpierr)
#endif

When applying Jans ALEPH workaround the writing times for each 2d data slice reduce to ...

u: current mean I/O counter =            1
 writing mean record for u; rec. count =            1
  -I/O-> after nf_put_vara_real        3810           1  0.101245949332224     
  -I/O-> after nf_put_vara_real        3810           2  0.143262744899403     
  -I/O-> after nf_put_vara_real        3810           3  0.142665563165792     
  -I/O-> after nf_put_vara_real        3810           4  0.142954538841877     
  -I/O-> after nf_put_vara_real        3810           5  0.144681202767970     
  -I/O-> after nf_put_vara_real        3810           6  0.144374006733415     
  -I/O-> after nf_put_vara_real        3810           7  0.143825595958333     
  -I/O-> after nf_put_vara_real        3810           8  0.145763394029927     
  -I/O-> after nf_put_vara_real        3810           9  0.143234394341562     
  -I/O-> after nf_put_vara_real        3810          10  0.171776635153947

... and model finishes properly!!!

@hegish
Copy link
Collaborator

hegish commented Dec 23, 2022

If the ENABLE_ALEPH_CRAYMPICH_WORKAROUNDS solves things, then there is an issue with MPI on Albedo. Which output precision do you set in the namelist.io?
How can I get access to Albedo?

@patrickscholz
Copy link
Contributor Author

@hegish I played around with single precision output. To grant access to albedo i think malte thoma and maybe paul gierz is responsible. But im not sure they will be available over the holidays.
Mfg
Patrick

@hegish
Copy link
Collaborator

hegish commented Dec 23, 2022

The weird thing on Aleph was: the slowdown each level was much worse for single precision (real4) output. real8 was much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants