ALBEDO FESOM2 writing 3d output stream issue #396

patrickscholz · 2022-12-15T16:43:04Z

There seems to be an issue on ALBEDO when writing 3D output variables! The 3d output seems to be written into file, at least the written 3d files are full

-rw-r--r-- 1 pscholz hpc_user 574M Dec 15 17:17 u.fesom.1958.nc

...but afterwards the stream seems to become stucked at...

initializing I/O file for u
associating mean I/O file /albedo/work/user/pscholz/results/test_dart_linfs_pc0/chain/u.fesom.1958.nc
u: current mean I/O counter =            1
writing mean record for u; rec. count =            1
run: Job step aborted: Waiting up to 62 seconds for job step to finish.
forrtl: error (78): process killed (SIGTERM)

This only happens when defining output streams for 3d variables, as long as i only define 2d variables the model output seems to be fine.

NEC approved compiler settings on albedo:

-march=core-avx2 -O3 -ip -fPIC -qopt-malloc-options=2 -qopt-prefetch=5 -unroll-aggressive

Modules and environment settings for albedo are:

# make the contents as shell agnostic as possible so we can include them with bash, zsh and others
module load intel-oneapi-compilers 
export FC="mpiifort -qmkl" CC=mpiicc CXX=mpiicpc
module load intel-oneapi-mpi/2021.6.0
module load intel-oneapi-mkl/2022.1.0
module load netcdf-fortran/4.5.4-intel-oneapi-mpi2021.6.0-oneapi2022.1.0
module load netcdf-c/4.8.1-intel-oneapi-mpi2021.6.0-oneapi2022.1.0
# from DKRZ recommented environment variables on levante
# (https://docs.dkrz.de/doc/levante/running-jobs/runtime-settings.html) 
export HCOLL_ENABLE_MCAST_ALL="0"
export HCOLL_MAIN_IB=mlx5_0:1
export UCX_IB_ADDR_TYPE=ib_global
export UCX_NET_DEVICES=mlx5_0:1
export UCX_TLS=mm,knem,cma,dc_mlx5,dc_x,self # this line here brings the most speedup factor ~1.5
export UCX_UNIFIED_MODE=y
export UCX_HANDLE_ERRORS=bt
export HDF5_USE_FILE_LOCKING=FALSE
export I_MPI_PMI=pmi2
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so

with these compiler options and environment variables we reach on albedo (neglecting the output) the same performance as on levante
Things compile fine, no error message can be triggered !
I tried with and without asynchronous I/O (DISABLE_MULTITHREADING ON/OFF) both times the same problem

Anybody (@hegish ) an idea what could be the problem?

The text was updated successfully, but these errors were encountered:

patrickscholz · 2022-12-15T17:09:39Z

OK ?! it seems to be also mesh dependent:
3D output streams for core mesh (127K) works but not for much bigger dart mesh (3.1M, partitioned for 48nodes 127CPUs)

pgierz · 2022-12-15T17:12:01Z

Hi Patrick, I'll look into this. Do you have a template compile script somewhere if I need to play around with it?

pgierz · 2022-12-15T17:12:39Z

Nevermind, already posted. Sorry for not seeing that ;-)

patrickscholz · 2022-12-15T17:24:25Z

You can basically use the branch refactoring_albedo_env, there i played around (compiler options, environment variables, albedo job scripts ...) to make FESOM2 work on albedo

pgierz · 2022-12-16T10:17:36Z

Are you sure everything is checked in, Patrick? I'd like to directly reproduce your problem on my account, but your job script links in a namelist.config with paths pointing to ollie. I can modify it to work, but I would need to know where you stored your mesh ;-)

patrickscholz · 2022-12-16T10:21:18Z

Ahh, no i did not edit the default namelists for albedo!!! But you can take them directly from my albedo directory /albedo/home/pscholz/fesom2_refactoring/work_dart_pc0. I only edited the environment files and the job scipts for albedo use.

pgierz · 2022-12-16T10:36:23Z

OK, I found the log files through slurm. Can you please try this also using -g -traceback? It seems as if your MPI crashes, but I cannot see any other interesting info yet. I will keep looking:

From /albedo/home/pscholz/fesom2_refactoring/work_dart_pc0/fesom2_dart6096_test_srelax:0_754731.out

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
fesom.x            000000000069D0FB  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  0000155548ECACE0  Unknown               Unknown  Unknown
libmpi.so.12.0.0   0000155549D85FAB  Unknown               Unknown  Unknown

pgierz · 2022-12-16T10:44:31Z

First glance also seems to point to out-of-memory or invalid memory access error...? If it works in the smaller case but not in the larger one.

patrickscholz · 2022-12-16T11:00:16Z

OK thats different, i always used the full debug options -g -traceback -check all,noarg_temp_created,bounds,uninit there got stucked at a much earlier point that i think is not related to that problem. Because i get stucked after he already wrote data to file ...

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable AUX_R4 when it is not allocated

Image              PC                Routine            Line        Source             
fesom.x            000000000266060F  Unknown               Unknown  Unknown
fesom.x            00000000014EC4E8  io_meandata_mp_wr         818  io_meandata.F90
fesom.x            00000000014F54C4  io_meandata_mp_do        1034  io_meandata.F90
fesom.x            0000000000490D92  async_threads_exe         104  async_threads_module.F90
fesom.x            0000000000490663  async_threads_mod          74  async_threads_module.F90
fesom.x            00000000014F503B  io_meandata_mp_ou        1014  io_meandata.F90
fesom.x            00000000005EECCC  fesom_module_mp_f         378  fesom_module.F90
fesom.x            0000000002618A89  MAIN__                     15  fesom_main.F90

patrickscholz · 2022-12-16T11:03:50Z

@pgierz I just tried only -g -traceback i cant trigger any error message. Is the error message you found simply from exceeding the wall clock time limit and the node killing off everything?

pgierz · 2022-12-16T11:05:46Z

No, I was just trying to get to the root of this unknown stuff, but of course that means I need to compile a MPI library with traceback, not just FESOM, so I was too quick with my comment in that case...

trackow · 2022-12-16T11:09:29Z

The symptoms remind me a bit of #173 , but probably not related? Output was either hanging or super slow in my original runs. @hegish will know

pgierz · 2022-12-16T11:15:33Z

@patrickscholz: Then maybe it is indeed some kind of an allocation error. So if understand correctly:

Case 1: you compile with full debug flags for FESOM, and then get a complaint:

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable AUX_R4 when it is not allocated

Here's the variable it is complaining about:

fesom2/src/io_meandata.F90

Line 25 in 00df069

real(real32), allocatable :: aux_r4(:)

It also seems to be allocated here:

fesom2/src/io_meandata.F90

Line 747 in 00df069

if(.not. allocated(entry%aux_r4)) allocate(entry%aux_r4(size2))

Case 2: you compile with NEC settings, and then run into a segmentation fault.

Maybe something is messing up in the allocation if you have a lot more MPI tasks?

patrickscholz · 2022-12-16T11:46:31Z

Im not sure ... this allocation should happen before you write and not afterwards, since the data are already stored in file when he hangs up. I have the impression this is another issue

patrickscholz · 2022-12-20T17:51:16Z

@hegish & @pgierz i further identified the issue, which is here

fesom2/src/io_meandata.F90

Line 826 in 0d7d80d

    
           call assert_nf( nf_put_vara_real(entry%ncid, entry%varID, (/1, lev, entry%rec_count/), (/size2, 1, 1/), entry%aux_r4, 1), __LINE__)

fesom2/src/io_meandata.F90

Lines 808 to 830 in 0d7d80d

    
             else if (entry%accuracy == i_real4) then 
        
                if(entry%p_partit%mype==entry%root_rank) then 
        
                  if(.not. allocated(entry%aux_r4)) allocate(entry%aux_r4(size2)) 
        
                end if 
        
                do lev=1, size1 
        
           #ifdef ENABLE_ALEPH_CRAYMPICH_WORKAROUNDS 
        
                   ! aleph cray-mpich workaround 
        
                   call MPI_Barrier(entry%comm, mpierr) 
        
           #endif 
        
                  if(.not. entry%is_elem_based) then 
        
                    call gather_real4_nod2D (entry%local_values_r4_copy(lev,1:size(entry%local_values_r4_copy,dim=2)), entry%aux_r4, entry%root_rank, tag, entry%comm, entry%p_partit) 
        
                  else 
        
                    call gather_real4_elem2D(entry%local_values_r4_copy(lev,1:size(entry%local_values_r4_copy,dim=2)), entry%aux_r4, entry%root_rank, tag, entry%comm, entry%p_partit) 
        
                  end if 
        
                   if (entry%p_partit%mype==entry%root_rank) then 
        
                      if (entry%ndim==1) then 
        
                        call assert_nf( nf_put_vara_real(entry%ncid, entry%varID, (/1, entry%rec_count/), (/size2, 1/), entry%aux_r4, 1), __LINE__) 
        
                      elseif (entry%ndim==2) then 
        
                        call assert_nf( nf_put_vara_real(entry%ncid, entry%varID, (/1, lev, entry%rec_count/), (/size2, 1, 1/), entry%aux_r4, 1), __LINE__) 
        
                      end if 
        
                   end if 
        
                end do 
        
             end if

every time the model writes a 2d slice of the 3d data via call assert_nf(...) the model needs longer and longer to write that 2d data slice until it looks like the model hang up. No Idea whats the exact cause ...

u: current mean I/O counter =            1
 writing mean record for u; rec. count =            1
  --> call update_atm_forcing(n)
  --> call ice_timestep(n)
      --> call EVPdynamics...
                                     root_rank       lvl   time(sec)
  -I/O-> after nf_put_vara_real        3810           1  0.905330052581121     
  -I/O-> after nf_put_vara_real        3810           2   2.01729261684522     
  -I/O-> after nf_put_vara_real        3810           3   3.03888718470262     
  -I/O-> after nf_put_vara_real        3810           4   4.41701754259338     
  -I/O-> after nf_put_vara_real        3810           5   6.05037017439099     
  -I/O-> after nf_put_vara_real        3810           6   7.84644225670127     
  -I/O-> after nf_put_vara_real        3810           7   9.91847573239647     
  -I/O-> after nf_put_vara_real        3810           8   12.1049769922247     
  -I/O-> after nf_put_vara_real        3810           9   14.2242100060575     
  -I/O-> after nf_put_vara_real        3810          10   16.4367477671913

... But it looks like this problem can be solved by using Jans workaround for ALEPH also on ALBEDO

fesom2/src/io_meandata.F90

Lines 813 to 816 in 0d7d80d

    
           #ifdef ENABLE_ALEPH_CRAYMPICH_WORKAROUNDS 
        
                   ! aleph cray-mpich workaround 
        
                   call MPI_Barrier(entry%comm, mpierr) 
        
           #endif

When applying Jans ALEPH workaround the writing times for each 2d data slice reduce to ...

u: current mean I/O counter =            1
 writing mean record for u; rec. count =            1
  -I/O-> after nf_put_vara_real        3810           1  0.101245949332224     
  -I/O-> after nf_put_vara_real        3810           2  0.143262744899403     
  -I/O-> after nf_put_vara_real        3810           3  0.142665563165792     
  -I/O-> after nf_put_vara_real        3810           4  0.142954538841877     
  -I/O-> after nf_put_vara_real        3810           5  0.144681202767970     
  -I/O-> after nf_put_vara_real        3810           6  0.144374006733415     
  -I/O-> after nf_put_vara_real        3810           7  0.143825595958333     
  -I/O-> after nf_put_vara_real        3810           8  0.145763394029927     
  -I/O-> after nf_put_vara_real        3810           9  0.143234394341562     
  -I/O-> after nf_put_vara_real        3810          10  0.171776635153947

... and model finishes properly!!!

hegish · 2022-12-23T09:09:04Z

If the ENABLE_ALEPH_CRAYMPICH_WORKAROUNDS solves things, then there is an issue with MPI on Albedo. Which output precision do you set in the namelist.io?
How can I get access to Albedo?

patrickscholz · 2022-12-23T10:09:37Z

@hegish I played around with single precision output. To grant access to albedo i think malte thoma and maybe paul gierz is responsible. But im not sure they will be available over the holidays.
Mfg
Patrick

hegish · 2022-12-23T10:13:55Z

The weird thing on Aleph was: the slowdown each level was much worse for single precision (real4) output. real8 was much faster.

patrickscholz assigned pgierz, koldunovn, dsidoren, hegish and patrickscholz Dec 15, 2022

patrickscholz added the bug Something isn't working label Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALBEDO FESOM2 writing 3d output stream issue #396

ALBEDO FESOM2 writing 3d output stream issue #396

patrickscholz commented Dec 15, 2022

patrickscholz commented Dec 15, 2022

pgierz commented Dec 15, 2022

pgierz commented Dec 15, 2022

patrickscholz commented Dec 15, 2022 •

edited

Loading

pgierz commented Dec 16, 2022

patrickscholz commented Dec 16, 2022

pgierz commented Dec 16, 2022

pgierz commented Dec 16, 2022

patrickscholz commented Dec 16, 2022

patrickscholz commented Dec 16, 2022 •

edited

Loading

pgierz commented Dec 16, 2022

trackow commented Dec 16, 2022

pgierz commented Dec 16, 2022

patrickscholz commented Dec 16, 2022

patrickscholz commented Dec 20, 2022 •

edited

Loading

hegish commented Dec 23, 2022

patrickscholz commented Dec 23, 2022

hegish commented Dec 23, 2022 •

edited

Loading

ALBEDO FESOM2 writing 3d output stream issue #396

ALBEDO FESOM2 writing 3d output stream issue #396

Comments

patrickscholz commented Dec 15, 2022

patrickscholz commented Dec 15, 2022

pgierz commented Dec 15, 2022

pgierz commented Dec 15, 2022

patrickscholz commented Dec 15, 2022 • edited Loading

pgierz commented Dec 16, 2022

patrickscholz commented Dec 16, 2022

pgierz commented Dec 16, 2022

pgierz commented Dec 16, 2022

patrickscholz commented Dec 16, 2022

patrickscholz commented Dec 16, 2022 • edited Loading

pgierz commented Dec 16, 2022

trackow commented Dec 16, 2022

pgierz commented Dec 16, 2022

patrickscholz commented Dec 16, 2022

patrickscholz commented Dec 20, 2022 • edited Loading

hegish commented Dec 23, 2022

patrickscholz commented Dec 23, 2022

hegish commented Dec 23, 2022 • edited Loading

patrickscholz commented Dec 15, 2022 •

edited

Loading

patrickscholz commented Dec 16, 2022 •

edited

Loading

patrickscholz commented Dec 20, 2022 •

edited

Loading

hegish commented Dec 23, 2022 •

edited

Loading