-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run fails with seg fault (invalid memory reference) #528
Comments
This is a shot in the dark, but you may have to set KMP_STACKSIZE to a larger number with so many threads and grids.
— Marsha
… On Oct 11, 2021, at 12:38 PM, Thomas Vogt ***@***.***> wrote:
While tracing down issue #525 <#525>, I found a GeoClaw setup that crashes due to a segmentation fault on our cluster computer with 16 OpenMP threads, but doesn't crash on my local computer. I'm running GeoClaw a lot on our cluster and never had a segmentation fault before. Here is the terminal output: stdout.txt <https://github.com/clawpack/geoclaw/files/7324009/stdout.txt>
Have you ever seen something like this?
Maybe it's related to the following AMR settings:
7 =: amr_levels_max
2 2 2 2 2 4 =: refinement_ratios_x
I was adding the seventh AMR level when the error occurred for the first time. When using only 6 levels, the run will not fail.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#528>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUGCYAQZELT4O4KALEWW3UGMHHHANCNFSM5FYUIRVA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I was thinking that may be a thing to check as well. There's some info in the clawpack docs although I am not sure that's up to date anymore. At the very least if you have the time I would try a few things:
|
As always, thanks a lot for your quick responses! 🙏
|
That sounds like an actual bug, in a complicated place.
If you send me your setup I will try to reproduce it and take a look. What version of Clawpack are you running with?
— Marsha
… On Oct 13, 2021, at 9:06 AM, Thomas Vogt ***@***.***> wrote:
As always, thanks a lot for your quick responses!
Reducing the number of threads doesn't change the behavior, even when setting the number to 1. The code fails at the exact same step in the computation, and I get the stacktrace in the logs for a single thread in this case. The value of slimit -s has been unlimited anyways, and increasing the stack size for OpenMP to 32M or even 1G doesn't change the behavior either.
Here is the log output with increased verbosity, 8 threads, stacksize set to 1G: stdout.txt <https://github.com/clawpack/geoclaw/files/7337079/stdout.txt> fort.amr.txt <https://github.com/clawpack/geoclaw/files/7337066/fort.amr.txt>
With debugging and stack-tracing turned on (and only one OpenMP thread to keep the stack-trace clean): stdout.txt <https://github.com/clawpack/geoclaw/files/7338190/stdout.txt> The error message is now much clearer:
At line 39 of file $CLAW/amrclaw/src/2d/fluxsv.f
Fortran runtime error: Index '0' of dimension 2 of array 'node' below lower bound of 1
Error termination. Backtrace:
#0 0x4955c9 in fluxsv_
at $CLAW/amrclaw/src/2d/fluxsv.f:39
#1 0x53aa6c in par_advanc_
at $CLAW/geoclaw/src/2d/shallow/advanc.f:260
#2 0x53cd6c in advanc_._omp_fn.1
at $CLAW/geoclaw/src/2d/shallow/advanc.f:123
#3 0x2b967165eaae in GOMP_parallel
at ../.././libgomp/parallel.c:168
#4 0x53b90e in advanc_
at $CLAW/geoclaw/src/2d/shallow/advanc.f:124
#5 0x52b741 in tick_
at $CLAW/geoclaw/src/2d/shallow/tick.f:303
#6 0x5440a8 in amr2
at $CLAW/geoclaw/src/2d/shallow/amr2.f90:646
#7 0x5477aa in main
at $CLAW/geoclaw/src/2d/shallow/amr2.f90:59
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#528 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUGC2676DR2PZGAGD4PW3UGV745ANCNFSM5FYUIRVA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
That is interesting. At least it's not a parallel bug. Hopefully the intel compiler can reproduce it as well. I do not suppose you have that compiler handy @tovogt? |
This is with a clean checkout of the Furthermore, the runs on our cluster are with gfortran 7.3.0 (OpenMP 4.5/201511) and linux kernel version 4.4. But I also reproduced this with gfortran 10.2.0 (same version of OpenMP). Here are the rundata files: gc_segfault_rundata.zip Before running Regarding your question about an intel compiler: We have ifort version 17.0.0 on our cluster, will try that later. |
In fact, I can't reproduce this error with the intel compiler (ifort 17.0.0)! |
Hmmm. Well we will have to see what we find might be going on with gfortran then. |
Since the last activity in this issue, I mostly continued working with gfortran. But yesterday, I started a whole bunch of runs with Intel's
Apparently, the To be honest, I'm a bit puzzled by this: The topography I provide to GeoClaw has a resolution of 30 arc-seconds (~1 km) and the mesh resolution with maximum refinement is 7 arc-seconds (200 m). In the TC literature, in the TC examples and from what I hear from other GeoClaw users (with TC context), a lot of them use much higher resolution. How is it possible that I still run into these problems while apparently it works for other people? I even talked to people that use |
My guess is this is not a problem with topography resolution but perhaps a bug although the fact that you see it this often is a bit puzzling to me as well. My best guess without looking into it with @mjberger further is that there is something about your location that is causing the problem but who knows. If you can send us your setup as @mjberger suggested we can try and figure out what's going wrong hopefully. |
I described the setup in #528 (comment) and provided the rundata files. Do you need any other information from my side? |
Sorry, forgot about that. We will try and take a look and see if we can reproduce and debug the problem. |
HI Thomas,
I've started looking at your issue. I've duplicated the error with the gfortran-8 compiler. Interestingly it runs to completion with -g, and only dies with optimization -O0 on. Makes it harder to debug but will give it a try.
Anyhow, 2 things.
1. Your zip file had an empty setrun.py. It would make it easier to have one, e.g. so I could more easily checkpoint before the error, etc.
2. Why don't you send me your direct email address , so we don't have to use the clawpack notifications.
Mine is ***@***.*** ***@***.***>
Best,
— Marsha
… On Oct 13, 2021, at 11:02 AM, Thomas Vogt ***@***.***> wrote:
This is with a clean checkout of the clawpack master branch and its submodules, commit ***@***.*** <clawpack/clawpack@bf55916> which is basically version 5.8.0. Only for the geoclaw submodule, I use the most recent master (e8d0bf3 <e8d0bf3>) instead of the submodule specification from clawpack (which would be 01c9a8e <01c9a8e>).
Furthermore, the runs on our cluster are with gfortran 7.3.0 (OpenMP 4.5/201511) and linux kernel version 4.4. But I also reproduced this with gfortran 10.2.0 (same version of OpenMP).
Here are the rundata files: gc_segfault_rundata.zip <https://github.com/clawpack/geoclaw/files/7338799/gc_segfault_rundata.zip> Before running make .output on your machine, make sure to specify the absolute path to your work directory in topo.data and surge.data.
Regarding your question about an intel compiler: We have ifort version 17.0.0 on our cluster, will try that later.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#528 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUGCZDRYXW6FYCWTCLZN3UGWNR5ANCNFSM5FYUIRVA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Added back zeroing out of cfluxptr on coarse grids that was removed in PR clawpack#268. Also correctly typed old_memsize as integer.
After stress testing this with a whole batch of jobs, I only have a single run that fails with a segfault (with gfortran). But the error message changed, so it might be completely unrelated to this issue:
Here is the setup: gc_segfpe.zip Would you prefer to have a separate issue for this or continue to discuss this here? |
what compiler options and compiler? Both?
— Marsha
… On Nov 12, 2021, at 9:44 AM, Thomas Vogt ***@***.***> wrote:
After stress testing this with a whole batch of jobs, I only have a single run that fails with a segfault (with gfortran). But the error message changed, so it might be completely unrelated to this issue:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x2ae8cc720fdf in ???
#1 0x5173ca in gfixup_
at $CLAW/geoclaw/src/2d/shallow/gfixup.f:234
#2 0x49690d in regrid_
at $CLAW/amrclaw/src/2d/regrid.f:71
#3 0x50f081 in tick_
at $CLAW/geoclaw/src/2d/shallow/tick.f:226
#4 0x525304 in amr2
at $CLAW/geoclaw/src/2d/shallow/amr2.f90:646
#5 0x52873c in main
at $CLAW/geoclaw/src/2d/shallow/amr2.f90:59
Traceback (most recent call last):
File "$CLAW/clawutil/src/python/clawutil/runclaw.py", line 249, in runclaw
proc = subprocess.check_call(cmd_split,
File "$CONDA_PREFIX/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['$WORKDIR/xgeoclaw']' died with <Signals.SIGFPE: 8>.
Here is the setup: gc_segfpe.zip <https://github.com/clawpack/geoclaw/files/7528364/gc_segfpe.zip>
Would you prefer to have a separate issue for this or continue to discuss this here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#528 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUGC2OQPTUUG6WNC5ZC2DULUR4FANCNFSM5FYUIRVA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
The above error output is with gfortran 7.3.0 and the following flags:
I currently try to reproduce this with |
I am running it through my setup with checkpoints turned on. I will inform if I can reproduce the problem. |
Oh, and due to the location of the signal I am somewhat suspicious that this may be a related problem. |
Unfortunately, I still can't tell whether this also applies to the Intel compiler because the process has been cancelled by our cluster after 24 hours due to a time limit. I had to restart with a longer time limit and that will take a while now... Note that the original gfortran run took 65 minutes to fail, and 3 hours 45 minutes with debugging flags. I just started another Intel compiler run without debugging flags so that we can see more quickly whether it fails at all. |
the intel compiler was the one that the stack size helped for me, on my mac though. with the gfortran-8 compiler never broke.
We have a checkpoint option that might help with these long runs. It alternative between saving , and then overwriting, between 2 checkpoints. You set the time interval of the checkpointing. That way you could restart from the last one.
— Marsha
… On Nov 15, 2021, at 5:30 AM, Thomas Vogt ***@***.***> wrote:
Unfortunately, I still can't tell whether this also applies to the Intel compiler because the process has been cancelled by our cluster after 24 hours due to a time limit. I had to restart with a longer time limit and that will take a while now...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#528 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUGC73RQXAEF2E3DK5GP3UMDOLNANCNFSM5FYUIRVA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I did finish the run on my laptop, it took nearly 48 hours on an 8 core i9. Looking at the output it looks like the run is refining everywhere, even after the storm is well inland. One thing that I am somewhat concerned with is the size of the domain relative to the storm is quite small, which can cause boundary condition problems. |
When you say "the size of the domain relative to the storm is quite small", what do you take as a rule of thumb to determine an appropriate domain size? Currently, I select the domain to accommodate the My experience with larger domain sizes is that it will increase runtimes in most cases dramatically, won't change the results in the landfall area of the storm significantly, and will sometimes cause GeoClaw to spend a lot of time computing complex interaction of dynamics with topographic features that are at a large distance from the region of interest. |
I often run with large domains where the storm is at least 2-3 |
Thanks, Kyle, this is really very helpful! In fact, in this case, enlarging the domain to 2.5 Still it's a bit puzzling that you don't seem to be able to reproduce the latest segfault I reported about. 😞 |
Yes, that's still worrying but I wonder now if it's somehow related to either length of time or amount of memory that the process is requesting. |
For me with the ifort compiler it was the memory size. The code died in an unusual place, and was able to run through it when I enlarged the stack size limit.
— Marsha
… On Nov 15, 2021, at 11:52 AM, Kyle Mandli ***@***.***> wrote:
Yes, that's still worrying but I wonder now if it's somehow related to either length of time or amount of memory that the process is requesting.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#528 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUGC3R7DRQ7O37XXAOVQLUME3GXANCNFSM5FYUIRVA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Currently, I use the following settings:
I will rerun with gfortran and a higher |
I definitely had upwards of 4-5 GB of memory dedicated to the run so the stack size problem may well be the issue. The variables |
Sure, in my setup, the threads also have 4 GB of memory each. But isn't the stack size a different thing and will typically not be larger than a few megabytes? On the contrary, I thought that a larger stack size will reduce the amount of available "heap" memory (because memory is reserved for the stack), so that it's also not advisable to increase the stacksize indefinitely. |
I can confirm that this runs through with the Intel compiler (even with STACKSIZE set to 500M, as above) after almost 19 hours (!). Furthermore, running with gfortran and a larger STACKSIZE (1000M), there won't be a segmentation fault. Still, the run will fail after 68 minutes with
How is it possible that the behavior of GeoClaw depends on the compiler in such a fundamental way? It's not only the performance that is different, but the numerical stability seems to be different, as well. |
So the new problem is that ifort compiler works, and gfortran
fails with too many dt reductions (a completely separate problem than the seg vi)?
The compiler issue is even worse than you thought. Did you know that the intel compilers generate code that is not repeatable unless you use "fp-model precise"
See this link:
https://www.intel.com/content/dam/develop/external/us/en/documents/fp-consistency-102511-326704.pdf
— Marsha
… On Nov 16, 2021, at 6:25 AM, Thomas Vogt ***@***.***> wrote:
I can confirm that this runs through with the Intel compiler (even with STACKSIZE set to 500M, as above).
Furthermore, running with gfortran and a larger STACKSIZE (1000M), there won't be a segmentation fault. Still, the run will fail with
**** Too many dt reductions ****
**** Stopping calculation ****
How is it possible that the behavior of GeoClaw depends on the compiler in a such a fundamental way?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#528 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUGC34JG4RG47Q7G2LAXDUMI5RBANCNFSM5FYUIRVA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Optimization flags between compilers can be very different and at times can lead to incorrect mathematics being produced, enough so that impacts us. It's unfortunately a bit of a moving target as it does change between versions of compilers. My best guess as to why the stack size thing is happening is that gfortran is doing stack memory different (and less efficiently) than in ifort. For the dt reductions I would hazard also that the run is blowing up. Not sure if you can plot the problem to see. |
Yes, maybe let's agree that this issue can be closed once clawpack/amrclaw#272 is merged (thanks again for that!). As for the too many dt reductions, your remark about enlarging the domain while setting a maximum level of refinement far away from the storm helped a lot already. |
Great to hear! As you suggested, once clawpack/amrclaw#272 is merged let's close this and open up something else if needed. |
Fixed bug reported in clawpack/geoclaw#528
While tracing down issue #525, I found a GeoClaw setup that crashes due to a segmentation fault on our cluster computer with 16 OpenMP threads, but doesn't crash on my local desktop computer. I'm running GeoClaw a lot on our cluster and never had a segmentation fault before. Here is the terminal output: stdout.txt
Have you ever seen something like this?
Maybe it's related to the following AMR settings:
I was adding the seventh AMR level when the error occurred for the first time. When using only 6 levels, the run will not fail.
The text was updated successfully, but these errors were encountered: