Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run fails with seg fault (invalid memory reference) #528

Closed
tovogt opened this issue Oct 11, 2021 · 34 comments
Closed

Run fails with seg fault (invalid memory reference) #528

tovogt opened this issue Oct 11, 2021 · 34 comments

Comments

@tovogt
Copy link
Contributor

tovogt commented Oct 11, 2021

While tracing down issue #525, I found a GeoClaw setup that crashes due to a segmentation fault on our cluster computer with 16 OpenMP threads, but doesn't crash on my local desktop computer. I'm running GeoClaw a lot on our cluster and never had a segmentation fault before. Here is the terminal output: stdout.txt

Have you ever seen something like this?

Maybe it's related to the following AMR settings:

7                    =: amr_levels_max      
2 2 2 2 2 4          =: refinement_ratios_x 

I was adding the seventh AMR level when the error occurred for the first time. When using only 6 levels, the run will not fail.

@mjberger
Copy link
Contributor

mjberger commented Oct 11, 2021 via email

@mandli
Copy link
Member

mandli commented Oct 11, 2021

I was thinking that may be a thing to check as well. There's some info in the clawpack docs although I am not sure that's up to date anymore.

At the very least if you have the time I would try a few things:

  1. Using fewer threads. This can test the stack size idea.
  2. Compiling with debugging and stack-tracing turned on. I commonly will use-O0 -W -Wall -fbounds-check -fcheck=all -Wunderflow -fbacktrace -ffpe-trap=invalid,zero,overflow -g. You can also keep on using OpenMP with this although the stack-trace may get confusing.
  3. Send us the fort.amr file from the _output directory as that will have some statistics regarding the grids that might be helpful. Also turning on some more verbosity in the output from refinement and output will give you more info about the number of grids and refinement characteristics.

@tovogt
Copy link
Contributor Author

tovogt commented Oct 13, 2021

As always, thanks a lot for your quick responses! 🙏

  • Reducing the number of threads doesn't change the behavior, even when setting the number to 1. The code fails at the exact same step in the computation, and I get the stacktrace in the logs for a single thread in this case. The value of slimit -s has been unlimited anyways, and increasing the stack size for OpenMP to 32M or even 1G doesn't change the behavior either.
  • Here is the log output with increased verbosity, 8 threads, stacksize set to 1G: stdout.txt fort.amr.txt
  • With debugging and stack-tracing turned on (and only one OpenMP thread to keep the stack-trace clean): stdout.txt The error message is now much clearer:
At line 39 of file $CLAW/amrclaw/src/2d/fluxsv.f
Fortran runtime error: Index '0' of dimension 2 of array 'node' below lower bound of 1

Error termination. Backtrace:
#0  0x4955c9 in fluxsv_
	at $CLAW/amrclaw/src/2d/fluxsv.f:39
#1  0x53aa6c in par_advanc_
	at $CLAW/geoclaw/src/2d/shallow/advanc.f:260
#2  0x53cd6c in advanc_._omp_fn.1
	at $CLAW/geoclaw/src/2d/shallow/advanc.f:123
#3  0x2b967165eaae in GOMP_parallel
	at ../.././libgomp/parallel.c:168
#4  0x53b90e in advanc_
	at $CLAW/geoclaw/src/2d/shallow/advanc.f:124
#5  0x52b741 in tick_
	at $CLAW/geoclaw/src/2d/shallow/tick.f:303
#6  0x5440a8 in amr2
	at $CLAW/geoclaw/src/2d/shallow/amr2.f90:646
#7  0x5477aa in main
	at $CLAW/geoclaw/src/2d/shallow/amr2.f90:59

@mjberger
Copy link
Contributor

mjberger commented Oct 13, 2021 via email

@mandli
Copy link
Member

mandli commented Oct 13, 2021

That is interesting. At least it's not a parallel bug. Hopefully the intel compiler can reproduce it as well. I do not suppose you have that compiler handy @tovogt?

@tovogt
Copy link
Contributor Author

tovogt commented Oct 13, 2021

This is with a clean checkout of the clawpack master branch and its submodules, commit clawpack/clawpack@bf55916 which is basically version 5.8.0. Only for the geoclaw submodule, I use the most recent master (e8d0bf3) instead of the submodule specification from clawpack (which would be 01c9a8e).

Furthermore, the runs on our cluster are with gfortran 7.3.0 (OpenMP 4.5/201511) and linux kernel version 4.4. But I also reproduced this with gfortran 10.2.0 (same version of OpenMP).

Here are the rundata files: gc_segfault_rundata.zip Before running make .output on your machine, make sure to specify the absolute path to your work directory in topo.data and surge.data.

Regarding your question about an intel compiler: We have ifort version 17.0.0 on our cluster, will try that later.

@tovogt
Copy link
Contributor Author

tovogt commented Oct 13, 2021

In fact, I can't reproduce this error with the intel compiler (ifort 17.0.0)!

@mandli
Copy link
Member

mandli commented Oct 13, 2021

Hmmm. Well we will have to see what we find might be going on with gfortran then.

@tovogt
Copy link
Contributor Author

tovogt commented Nov 2, 2021

Since the last activity in this issue, I mostly continued working with gfortran. But yesterday, I started a whole bunch of runs with Intel's ifort compiler (tested both version 17 and 19, with ulimit -s unlimited and OMP_STACKSIZE=500M) and found that a lot of those runs were failing due to segmentation faults. Output with -traceback compiler flag:

forrtl: severe (408): fort: (3): Subscript #2 of the array NODE has value -2 which is less than the lower bound of 1

Apparently, the ifort compiler is also affected by this issue, but it fails at a different point and for different input data. By the way, the same runs will fail at a different point in the simulation when run with gfortran (backtrace same as in my post above).

To be honest, I'm a bit puzzled by this: The topography I provide to GeoClaw has a resolution of 30 arc-seconds (~1 km) and the mesh resolution with maximum refinement is 7 arc-seconds (200 m). In the TC literature, in the TC examples and from what I hear from other GeoClaw users (with TC context), a lot of them use much higher resolution. How is it possible that I still run into these problems while apparently it works for other people? I even talked to people that use refinement_ratios_x = [2, 2, 2, 6, 8, 8] (which is orders of magnitude higher resolution than what I use) on the same hardware (16 cores with 4 GB memory each). Nobody ever mentioned any issues like the one discribed here or in #525.

@mandli
Copy link
Member

mandli commented Nov 2, 2021

My guess is this is not a problem with topography resolution but perhaps a bug although the fact that you see it this often is a bit puzzling to me as well. My best guess without looking into it with @mjberger further is that there is something about your location that is causing the problem but who knows. If you can send us your setup as @mjberger suggested we can try and figure out what's going wrong hopefully.

@tovogt
Copy link
Contributor Author

tovogt commented Nov 3, 2021

I described the setup in #528 (comment) and provided the rundata files. Do you need any other information from my side?

@mandli
Copy link
Member

mandli commented Nov 3, 2021

Sorry, forgot about that. We will try and take a look and see if we can reproduce and debug the problem.

@mjberger
Copy link
Contributor

mjberger commented Nov 6, 2021 via email

mandli added a commit to mandli/amrclaw that referenced this issue Nov 9, 2021
Added back zeroing out of cfluxptr on coarse grids that was
removed in PR clawpack#268.  Also correctly typed old_memsize as integer.
@tovogt
Copy link
Contributor Author

tovogt commented Nov 12, 2021

After stress testing this with a whole batch of jobs, I only have a single run that fails with a segfault (with gfortran). But the error message changed, so it might be completely unrelated to this issue:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x2ae8cc720fdf in ???
#1  0x5173ca in gfixup_
	at $CLAW/geoclaw/src/2d/shallow/gfixup.f:234
#2  0x49690d in regrid_
	at $CLAW/amrclaw/src/2d/regrid.f:71
#3  0x50f081 in tick_
	at $CLAW/geoclaw/src/2d/shallow/tick.f:226
#4  0x525304 in amr2
	at $CLAW/geoclaw/src/2d/shallow/amr2.f90:646
#5  0x52873c in main
	at $CLAW/geoclaw/src/2d/shallow/amr2.f90:59
Traceback (most recent call last):
  File "$CLAW/clawutil/src/python/clawutil/runclaw.py", line 249, in runclaw
    proc = subprocess.check_call(cmd_split,
  File "$CONDA_PREFIX/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['$WORKDIR/xgeoclaw']' died with <Signals.SIGFPE: 8>.

Here is the setup: gc_segfpe.zip

Would you prefer to have a separate issue for this or continue to discuss this here?

@mjberger
Copy link
Contributor

mjberger commented Nov 12, 2021 via email

@tovogt
Copy link
Contributor Author

tovogt commented Nov 12, 2021

The above error output is with gfortran 7.3.0 and the following flags:

FFLAGS='-fopenmp -O0 -W -Wall -fbounds-check -fcheck=all -Wunderflow -fbacktrace -ffpe-trap=invalid,zero,overflow -g'

I currently try to reproduce this with ifort. The process has been running for 7 hours without failing, but it's not finished yet ...

@mandli
Copy link
Member

mandli commented Nov 12, 2021

I am running it through my setup with checkpoints turned on. I will inform if I can reproduce the problem.

@mandli
Copy link
Member

mandli commented Nov 12, 2021

Oh, and due to the location of the signal I am somewhat suspicious that this may be a related problem.

@tovogt
Copy link
Contributor Author

tovogt commented Nov 15, 2021

Unfortunately, I still can't tell whether this also applies to the Intel compiler because the process has been cancelled by our cluster after 24 hours due to a time limit. I had to restart with a longer time limit and that will take a while now...

Note that the original gfortran run took 65 minutes to fail, and 3 hours 45 minutes with debugging flags. I just started another Intel compiler run without debugging flags so that we can see more quickly whether it fails at all.

@mjberger
Copy link
Contributor

mjberger commented Nov 15, 2021 via email

@mandli
Copy link
Member

mandli commented Nov 15, 2021

I did finish the run on my laptop, it took nearly 48 hours on an 8 core i9. Looking at the output it looks like the run is refining everywhere, even after the storm is well inland. One thing that I am somewhat concerned with is the size of the domain relative to the storm is quite small, which can cause boundary condition problems.

@tovogt
Copy link
Contributor Author

tovogt commented Nov 15, 2021

When you say "the size of the domain relative to the storm is quite small", what do you take as a rule of thumb to determine an appropriate domain size?

Currently, I select the domain to accommodate the storm_radius, which is orders of magnitude larger than the max_wind_radius storm variable. I assumed that most surge dynamics would be concentrated within that area. Would you go for twice the storm_radius? Or three times, or 10 times, ...?

My experience with larger domain sizes is that it will increase runtimes in most cases dramatically, won't change the results in the landfall area of the storm significantly, and will sometimes cause GeoClaw to spend a lot of time computing complex interaction of dynamics with topographic features that are at a large distance from the region of interest.

@mandli
Copy link
Member

mandli commented Nov 15, 2021

I often run with large domains where the storm is at least 2-3 storm_radius interior to the domain but lately I have also had storm_radius set to a larger number as that parameter is often not provided. The key here though is to restrict the resolution in most of the domain to be low but keeping the resolution criteria free to refine wither only in the near shore or, even better, only along the track of the storm.

@tovogt
Copy link
Contributor Author

tovogt commented Nov 15, 2021

Thanks, Kyle, this is really very helpful!

In fact, in this case, enlarging the domain to 2.5 storm_radius and enforcing a low resolution outside of a quite narrow window around the storm, does help: There is no segmentation fault and the whole simulation finishes on our cluster in less than 25 minutes (previously, it ran for over an hour until the segmentation fault happened).

Still it's a bit puzzling that you don't seem to be able to reproduce the latest segfault I reported about. 😞

@mandli
Copy link
Member

mandli commented Nov 15, 2021

Yes, that's still worrying but I wonder now if it's somehow related to either length of time or amount of memory that the process is requesting.

@mjberger
Copy link
Contributor

mjberger commented Nov 15, 2021 via email

@tovogt
Copy link
Contributor Author

tovogt commented Nov 15, 2021

Currently, I use the following settings:

export OMP_STACKSIZE=500M
export GOMP_STACKSIZE=$OMP_STACKSIZE
export KMP_STACKSIZE=$OMP_STACKSIZE

ulimit -t unlimited              # cputime
ulimit -f unlimited              # filesize
ulimit -d unlimited              # datasize
ulimit -s unlimited              # stacksize
ulimit -c unlimited              # coredumpsize
ulimit -v unlimited              # vmemoryuse
ulimit -l unlimited              # memorylocked

I will rerun with gfortran and a higher OMP_STACKSIZE and see whether it still fails or whether the time of failure changes.

@mandli
Copy link
Member

mandli commented Nov 15, 2021

I definitely had upwards of 4-5 GB of memory dedicated to the run so the stack size problem may well be the issue. The variables GOMP_STACKSIZE and KMP_STACKSIZE I have seen override the ulimit command so I am now wondering if that is the real problem.

@tovogt
Copy link
Contributor Author

tovogt commented Nov 15, 2021

Sure, in my setup, the threads also have 4 GB of memory each. But isn't the stack size a different thing and will typically not be larger than a few megabytes? On the contrary, I thought that a larger stack size will reduce the amount of available "heap" memory (because memory is reserved for the stack), so that it's also not advisable to increase the stacksize indefinitely.

@tovogt
Copy link
Contributor Author

tovogt commented Nov 16, 2021

I can confirm that this runs through with the Intel compiler (even with STACKSIZE set to 500M, as above) after almost 19 hours (!).

Furthermore, running with gfortran and a larger STACKSIZE (1000M), there won't be a segmentation fault. Still, the run will fail after 68 minutes with

 **** Too many dt reductions ****
 **** Stopping calculation   ****

How is it possible that the behavior of GeoClaw depends on the compiler in such a fundamental way? It's not only the performance that is different, but the numerical stability seems to be different, as well.

@mjberger
Copy link
Contributor

mjberger commented Nov 16, 2021 via email

@mandli
Copy link
Member

mandli commented Nov 16, 2021

Optimization flags between compilers can be very different and at times can lead to incorrect mathematics being produced, enough so that impacts us. It's unfortunately a bit of a moving target as it does change between versions of compilers. My best guess as to why the stack size thing is happening is that gfortran is doing stack memory different (and less efficiently) than in ifort.

For the dt reductions I would hazard also that the run is blowing up. Not sure if you can plot the problem to see.

@tovogt
Copy link
Contributor Author

tovogt commented Nov 17, 2021

Yes, maybe let's agree that this issue can be closed once clawpack/amrclaw#272 is merged (thanks again for that!).

As for the too many dt reductions, your remark about enlarging the domain while setting a maximum level of refinement far away from the storm helped a lot already.

@mandli
Copy link
Member

mandli commented Nov 17, 2021

Great to hear! As you suggested, once clawpack/amrclaw#272 is merged let's close this and open up something else if needed.

rjleveque added a commit to clawpack/amrclaw that referenced this issue Nov 24, 2021
@tovogt tovogt closed this as completed Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants