Replies: 1 comment
-
Hello, You can try to deactivate the validation of the restart file : set the variable However, the error message is not really clear. It looks like the program does not recognize the subroutine |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I hope this message finds you well.
I am currently simulating channel flow using Incompact3D. Below is a link to the input.i3d file for reference:
https://drive.google.com/file/d/1nxLQCpNr6y7GYLbLaxWbQJ_N0xpD9lsf/view?usp=drive_link
The simulation runs successfully on a single node, and it also works on multiple nodes when using Ethernet on the server. However, when I attempt to run the simulation across several nodes using Infiniband, the simulation crashes. I have tried different icheckpoint values, but the program consistently terminates (seemingly at random points).
I am attaching part of the screen output with the corresponding error message below:
...
===========================================================
Time step = 25/ 300000, Time unit = 0.1250
UT 0.66664693308142653 -1.9733585240100382E-005
Rotating turbulent channel at speed 0.12000000000000000
DIV U* max mean= 0.33152882513086468 3.6195093318096769E-002
DIV U max mean= 1.2412293415309250E-013 3.3384942753990172E-014
U,V,W min= -7.12380931E-02 -0.233213946 -0.219676018
U,V,W max= 1.20239806 0.241198793 0.227197528
CFL_x : 0.09619185
CFL_y : 0.04270758
CFL_z : 0.01817580
Phi1 min max= -1.36520198E-10 1.00000000
===========================================================<<<<<
Writing restart point restart0000025
Initialising IO for restart-io
Restart point restart0000025 saved successfully!
If necesseary restart from: 26
Time for this time step (s): 0.151510000
Remaining time: 2 h 11 min
Elapsed time: 0 h 0 min
Time step = 50/ 300000, Time unit = 0.2500
UT 0.66664622879943403 -2.0437867232603324E-005
Rotating turbulent channel at speed 0.12000000000000000
DIV U* max mean= 0.25515063013178074 2.8042547618158439E-002
DIV U max mean= 1.2734258092450546E-013 3.3297685492821250E-014
U,V,W min= -4.82805483E-02 -0.218854755 -0.198159263
U,V,W max= 1.24021733 0.233201653 0.205270544
CFL_x : 0.09921738
CFL_y : 0.03835384
CFL_z : 0.01642164
Phi1 min max= -4.22717833E-10 1.00000000
===========================================================<<<<<
Writing restart point restart0000050
Old Name:checkpoint
New Name:checkpoint.old
CMDLINE:mv checkpoint checkpoint.old
Fortran runtime error: EXECUTE_COMMAND_LINE: Invalid command line
Error termination. Backtrace:
#0 0x7f7ef39f4dfd in ???
#1 0x7f7ef39f5995 in ???
#2 0x7f7ef39f5c87 in ???
#3 0x7f7ef3c12be7 in ???
#4 0x7f7ef3c12dd9 in ???
#5 0x7f7ef3c12ef1 in ???
#6 0x62f798 in rename
at src/tools.f90:976
#7 0x62f798 in __tools_MOD_restart
at src/tools.f90:255
#8 0x403bbe in xcompact3d
at src/xcompact3d.f90:80
#9 0x403bbe in main
at src/xcompact3d.f90:7
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[5964,1],0]
Exit code: 2
...
At this point, I am not sure what the issue could be. Any help or insights would be greatly appreciated. Please let me know if you need further information.
Kind regards,
Patricio Canciani
Contact: [email protected]
Beta Was this translation helpful? Give feedback.
All reactions