Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmdLineTester_criu_jitPostRestore Test -Xnojit -Xnoaot hang #20663

Closed
pshipton opened this issue Nov 21, 2024 · 25 comments · Fixed by #20763
Closed

cmdLineTester_criu_jitPostRestore Test -Xnojit -Xnoaot hang #20663

pshipton opened this issue Nov 21, 2024 · 25 comments · Fixed by #20763
Assignees
Labels
comp:jit criu Used to track CRIU snapshot related work test failure

Comments

@pshipton
Copy link
Member

pshipton commented Nov 21, 2024

I've seen this multiple times in the last few days, always on zlinux.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.functional_s390x_linux_Nightly_testList_0/358 - rh8-390-2
cmdLineTester_criu_jitPostRestore_3

21:10:10  Testing: Test -Xnojit -Xnoaot
21:10:10  Test start time: 2024/11/20 21:10:08 Eastern Standard Time
21:10:10  Running command: bash /home/jenkins/workspace/Test_openjdk21_j9_sanity.functional_s390x_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk21_j9_sanity.functional_s390x_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk21_j9_sanity.functional_s390x_linux_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java " -XX:+DebugOnRestore -Xjit " org.openj9.criu.OptionsFileTest "JitOptionsTest -Xnojit -Xnoaot" 1 false false
21:10:10  Time spent starting: 4 milliseconds
21:15:16  ***[TEST INFO 2024/11/20 21:15:08] ProcessKiller detected a timeout after 300000 milliseconds!***
21:15:16  ***[TEST INFO 2024/11/20 21:15:08] executing /usr/bin/gdb -batch -x /tmp/debugger6889621406594086588.txt bash 440308***
21:15:16  GDB OUT 0x000003ffab44964a in waitpid () from /lib64/libc.so.6
21:15:16  GDB OUT From                To                  Syms Read   Shared Object Library
21:15:16  GDB OUT 0x000003ffab60e788  0x000003ffab61d140  Yes (*)     /lib64/libtinfo.so.6
21:15:16  GDB OUT 0x000003ffab580da8  0x000003ffab581af0  Yes (*)     /lib64/libdl.so.2
21:15:16  GDB OUT 0x000003ffab3a3120  0x000003ffab4c8008  Yes (*)     /lib64/libc.so.6
21:15:16  GDB OUT 0x000003ffab701120  0x000003ffab71ecb0  Yes         /lib/ld64.so.1
21:15:16  GDB OUT (*): Shared library is missing debugging information.
21:15:16  GDB OUT pswm           0x705000180000000   505810539591499776
21:15:16  GDB OUT pswa           0x3ffab44964a       4396624942666
21:15:16  GDB OUT r0             0x1                 1
21:15:16  GDB OUT r1             0x3ff00000000       4393751543808
21:15:16  GDB OUT r2             0xfffffffffffffe00  18446744073709551104
21:15:16  GDB OUT r3             0x3ffd39f84f8       4397301990648
21:15:16  GDB OUT r4             0x0                 0
21:15:16  GDB OUT r5             0x0                 0
21:15:16  GDB OUT r6             0x2aa04a4f818       2929245616152
21:15:16  GDB OUT r7             0x2aa0f4310e0       2929423749344
21:15:16  GDB OUT r8             0x2aa04a48dd8       2929245588952
21:15:16  GDB OUT r9             0x1                 1
21:15:16  GDB OUT r10            0x2aa04a48dd8       2929245588952
21:15:16  GDB OUT r11            0x3ff00000000       4393751543808
21:15:16  GDB OUT r12            0x3ffab72af70       4396627963760
21:15:16  GDB OUT r13            0x1                 1
21:15:16  GDB OUT r14            0x2aa04964fae       2929244655534
21:15:16  GDB OUT r15            0x3ffd39f8398       4397301990296
21:15:16  GDB OUT acr0           0x3ff               1023
21:15:16  GDB OUT acr1           0xab776720          2876729120
21:15:16  GDB OUT acr2           0x0                 0
21:15:16  GDB OUT acr3           0x0                 0
21:15:16  GDB OUT acr4           0x0                 0
21:15:16  GDB OUT acr5           0x0                 0
21:15:16  GDB OUT acr6           0x0                 0
21:15:16  GDB OUT acr7           0x0                 0
21:15:16  GDB OUT acr8           0x0                 0
21:15:16  GDB OUT acr9           0x0                 0
21:15:16  GDB OUT acr10          0x0                 0
21:15:16  GDB OUT acr11          0x0                 0
21:15:16  GDB OUT acr12          0x0                 0
21:15:16  GDB OUT acr13          0x0                 0
21:15:16  GDB OUT acr14          0x0                 0
21:15:16  GDB OUT acr15          0x0                 0
21:15:16  GDB OUT fpc            0x0                 0
21:15:16  GDB OUT orig_r2        0xffffffffffffffff  18446744073709551615
21:15:16  GDB OUT last_break     0x2aa0497fd4a       0x2aa0497fd4a <hash_search+130>
21:15:16  GDB OUT system_call    0x20072             131186
21:15:16  GDB OUT tdb0           <unavailable>
21:15:16  GDB OUT tac            <unavailable>
21:15:16  GDB OUT tct            <unavailable>
21:15:16  GDB OUT atia           <unavailable>
21:15:16  GDB OUT tr0            <unavailable>
21:15:16  GDB OUT tr1            <unavailable>
21:15:16  GDB OUT tr2            <unavailable>
21:15:16  GDB OUT tr3            <unavailable>
21:15:16  GDB OUT tr4            <unavailable>
21:15:16  GDB OUT tr5            <unavailable>
21:15:16  GDB OUT tr6            <unavailable>
21:15:16  GDB OUT tr7            <unavailable>
21:15:16  GDB OUT tr8            <unavailable>
21:15:16  GDB OUT tr9            <unavailable>
21:15:16  GDB OUT tr10           <unavailable>
21:15:16  GDB OUT tr11           <unavailable>
21:15:16  GDB OUT tr12           <unavailable>
21:15:16  GDB OUT tr13           <unavailable>
21:15:16  GDB OUT tr14           <unavailable>
21:15:16  GDB OUT tr15           <unavailable>
21:15:16  GDB OUT gsd            <unavailable>
21:15:16  GDB OUT gssm           <unavailable>
21:15:16  GDB OUT gsepla         <unavailable>
21:15:16  GDB OUT bc_gsd         <unavailable>
21:15:16  GDB OUT bc_gssm        <unavailable>
21:15:16  GDB OUT bc_gsepla      <unavailable>
21:15:16  GDB OUT pc             0x3ffab44964a       0x3ffab44964a <waitpid+34>
21:15:16  GDB OUT cc             0x0                 0
21:15:16  GDB OUT   Id   Target Id             Frame 
21:15:16  GDB OUT * 1    process 440308 "bash" 0x000003ffab44964a in waitpid () from /lib64/libc.so.6
21:15:16  GDB OUT 
21:15:16  GDB OUT Thread 1 (process 440308):
21:15:16  GDB OUT #0  0x000003ffab44964a in waitpid () from /lib64/libc.so.6
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #1  0x000002aa04964fae in waitchld.isra ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #2  0x000002aa04966a82 in wait_for ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #3  0x000002aa04950a66 in execute_command_internal ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #4  0x000002aa0494f9aa in execute_command_internal ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #5  0x000002aa0495157c in execute_command ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #6  0x000002aa0494f21c in execute_command_internal ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #7  0x000002aa0494f9aa in execute_command_internal ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #8  0x000002aa0495157c in execute_command ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #9  0x000002aa0494f0f2 in execute_command_internal ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #10 0x000002aa0495157c in execute_command ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #11 0x000002aa04934192 in reader_loop ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT #12 0x000002aa04931df2 in main ()
21:15:16  GDB OUT No symbol table info available.
21:15:16  GDB OUT Saved corefile core.440308.1
21:15:16  GDB OUT [Inferior 1 (process 440308) detached]
21:15:16  INFO: Running '/usr/bin/gdb' failed with rc = 1
21:15:16  GDB ERR warning: target file /proc/440308/cmdline contained unexpected null characters
21:15:16  GDB ERR /tmp/debugger6889621406594086588.txt:6: Error in sourced command file:
21:15:16  GDB ERR Requires argument (inferior id(s) to detach)
21:15:16  
21:15:16  INFO: Sleep for 60000 millis before next capture.
21:16:10  ***[TEST INFO 2024/11/20 21:16:08] executing /usr/bin/gdb -batch -x /tmp/debugger6889621406594086588.txt bash 440308***
21:16:10  GDB OUT 0x000003ffab44964a in waitpid () from /lib64/libc.so.6
21:16:10  GDB OUT From                To                  Syms Read   Shared Object Library
21:16:10  GDB OUT 0x000003ffab60e788  0x000003ffab61d140  Yes (*)     /lib64/libtinfo.so.6
21:16:10  GDB OUT 0x000003ffab580da8  0x000003ffab581af0  Yes (*)     /lib64/libdl.so.2
21:16:10  GDB OUT 0x000003ffab3a3120  0x000003ffab4c8008  Yes (*)     /lib64/libc.so.6
21:16:10  GDB OUT 0x000003ffab701120  0x000003ffab71ecb0  Yes         /lib/ld64.so.1
21:16:10  GDB OUT (*): Shared library is missing debugging information.
21:16:10  GDB OUT pswm           0x705000180000000   505810539591499776
21:16:10  GDB OUT pswa           0x3ffab44964a       4396624942666
21:16:10  GDB OUT r0             0x1                 1
21:16:10  GDB OUT r1             0x3ff00000000       4393751543808
21:16:10  GDB OUT r2             0xfffffffffffffe00  18446744073709551104
21:16:10  GDB OUT r3             0x3ffd39f84f8       4397301990648
21:16:10  GDB OUT r4             0x0                 0
21:16:10  GDB OUT r5             0x0                 0
21:16:10  GDB OUT r6             0x2aa04a4f818       2929245616152
21:16:10  GDB OUT r7             0x2aa0f4310e0       2929423749344
21:16:10  GDB OUT r8             0x2aa04a48dd8       2929245588952
21:16:10  GDB OUT r9             0x1                 1
21:16:10  GDB OUT r10            0x2aa04a48dd8       2929245588952
21:16:10  GDB OUT r11            0x3ff00000000       4393751543808
21:16:10  GDB OUT r12            0x3ffab72af70       4396627963760
21:16:10  GDB OUT r13            0x1                 1
21:16:10  GDB OUT r14            0x2aa04964fae       2929244655534
21:16:10  GDB OUT r15            0x3ffd39f8398       4397301990296
21:16:10  GDB OUT acr0           0x3ff               1023
21:16:10  GDB OUT acr1           0xab776720          2876729120
21:16:10  GDB OUT acr2           0x0                 0
21:16:10  GDB OUT acr3           0x0                 0
21:16:10  GDB OUT acr4           0x0                 0
21:16:10  GDB OUT acr5           0x0                 0
21:16:10  GDB OUT acr6           0x0                 0
21:16:10  GDB OUT acr7           0x0                 0
21:16:10  GDB OUT acr8           0x0                 0
21:16:10  GDB OUT acr9           0x0                 0
21:16:10  GDB OUT acr10          0x0                 0
21:16:10  GDB OUT acr11          0x0                 0
21:16:10  GDB OUT acr12          0x0                 0
21:16:10  GDB OUT acr13          0x0                 0
21:16:10  GDB OUT acr14          0x0                 0
21:16:10  GDB OUT acr15          0x0                 0
21:16:10  GDB OUT fpc            0x0                 0
21:16:10  GDB OUT orig_r2        0xffffffffffffffff  18446744073709551615
21:16:10  GDB OUT last_break     0x2aa0497fd4a       0x2aa0497fd4a <hash_search+130>
21:16:10  GDB OUT system_call    0x20072             131186
21:16:10  GDB OUT tdb0           <unavailable>
21:16:10  GDB OUT tac            <unavailable>
21:16:10  GDB OUT tct            <unavailable>
21:16:10  GDB OUT atia           <unavailable>
21:16:10  GDB OUT tr0            <unavailable>
21:16:10  GDB OUT tr1            <unavailable>
21:16:10  GDB OUT tr2            <unavailable>
21:16:10  GDB OUT tr3            <unavailable>
21:16:10  GDB OUT tr4            <unavailable>
21:16:10  GDB OUT tr5            <unavailable>
21:16:10  GDB OUT tr6            <unavailable>
21:16:10  GDB OUT tr7            <unavailable>
21:16:10  GDB OUT tr8            <unavailable>
21:16:10  GDB OUT tr9            <unavailable>
21:16:10  GDB OUT tr10           <unavailable>
21:16:10  GDB OUT tr11           <unavailable>
21:16:10  GDB OUT tr12           <unavailable>
21:16:10  GDB OUT tr13           <unavailable>
21:16:10  GDB OUT tr14           <unavailable>
21:16:10  GDB OUT tr15           <unavailable>
21:16:10  GDB OUT gsd            <unavailable>
21:16:10  GDB OUT gssm           <unavailable>
21:16:10  GDB OUT gsepla         <unavailable>
21:16:10  GDB OUT bc_gsd         <unavailable>
21:16:10  GDB OUT bc_gssm        <unavailable>
21:16:10  GDB OUT bc_gsepla      <unavailable>
21:16:10  GDB OUT pc             0x3ffab44964a       0x3ffab44964a <waitpid+34>
21:16:10  GDB OUT cc             0x0                 0
21:16:10  GDB OUT   Id   Target Id             Frame 
21:16:10  GDB OUT * 1    process 440308 "bash" 0x000003ffab44964a in waitpid () from /lib64/libc.so.6
21:16:10  GDB OUT 
21:16:10  GDB OUT Thread 1 (process 440308):
21:16:10  GDB OUT #0  0x000003ffab44964a in waitpid () from /lib64/libc.so.6
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #1  0x000002aa04964fae in waitchld.isra ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #2  0x000002aa04966a82 in wait_for ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #3  0x000002aa04950a66 in execute_command_internal ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #4  0x000002aa0494f9aa in execute_command_internal ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #5  0x000002aa0495157c in execute_command ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #6  0x000002aa0494f21c in execute_command_internal ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #7  0x000002aa0494f9aa in execute_command_internal ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #8  0x000002aa0495157c in execute_command ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #9  0x000002aa0494f0f2 in execute_command_internal ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #10 0x000002aa0495157c in execute_command ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #11 0x000002aa04934192 in reader_loop ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT #12 0x000002aa04931df2 in main ()
21:16:10  GDB OUT No symbol table info available.
21:16:10  GDB OUT Saved corefile core.440308.2
21:16:10  GDB OUT A debugging session is active.
21:16:10  GDB OUT 
21:16:10  GDB OUT 	Inferior 1 [process 440308] will be detached.
21:16:10  GDB OUT 
21:16:10  GDB OUT Quit anyway? (y or n) [answered Y; input not from terminal]
21:16:10  GDB OUT [Inferior 1 (process 440308) detached]
21:16:10  ***[TEST INFO 2024/11/20 21:16:08] executing kill -ABRT 440308***
21:16:10  ***[TEST INFO 2024/11/20 21:16:08] kill -ABRT signal sent***
21:16:10  ***[TEST INFO 2024/11/20 21:16:08] ABRT completed***
21:16:10  Output from test:
21:16:10   [OUT] start running script
21:16:10   [OUT] export GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVEC,-XSAVE,-AVX2,-ERMS,-AVX,-AVX_Fast_Unaligned_Load
21:16:10   [OUT] export LD_BIND_NOT=on
21:16:10   [OUT] /home/jenkins/workspace/Test_openjdk21_j9_sanity.functional_s390x_linux_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java -XX:+EnableCRIUSupport  -XX:+DebugOnRestore -Xjit  -cp /home/jenkins/workspace/Test_openjdk21_j9_sanity.functional_s390x_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.OptionsFileTest JitOptionsTest -Xnojit -Xnoaot 1
21:16:10   [ERR] /home/jenkins/workspace/Test_openjdk21_j9_sanity.functional_s390x_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh: line 41: 440309 Killed                  $2 -XX:+EnableCRIUSupport $3 -cp "$1/criu.jar" $4 $5 $6 > testOutput 2>&1
21:16:10  Time spent executing: 360309 milliseconds
21:16:10  Test result: FAILED
21:16:10  Output from test:
@pshipton pshipton added test failure criu Used to track CRIU snapshot related work labels Nov 21, 2024
@pshipton
Copy link
Member Author

@tajila

Copy link

Issue Number: 20663
Status: Open
Recommended Components: comp:test, comp:vm, comp:gc

@pshipton
Copy link
Member Author

pshipton commented Nov 21, 2024

@tajila
Copy link
Contributor

tajila commented Nov 21, 2024

@JasonFengJ9 Please take a look

@JasonFengJ9 JasonFengJ9 self-assigned this Nov 21, 2024
@JasonFengJ9
Copy link
Member

@JasonFengJ9
Copy link
Member

The grinder #20663 (comment) 40/50 passed in 4 machines, 10/10 failed at rh8-390-2 which also was the failing machine #20663 (comment) and #20663 (comment)

@JasonFengJ9
Copy link
Member

Can rh8-390-2 be reboot?

FYI @AdamBrousseau

@AdamBrousseau
Copy link
Contributor

Can rh8-390-2 be reboot?

Done

@JasonFengJ9
Copy link
Member

JasonFengJ9 commented Nov 21, 2024

3x grinder on the same machine - https://openj9-jenkins.osuosl.org/job/Grinder/3999/ - all failed

@pshipton
Copy link
Member Author

It also occurs on rh8-390-1 as per the latest failure, which I'm sure I've seen before, but didn't spend much time looking for it.
https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_s390x_linux_Nightly_testList_0/117/

@JasonFengJ9
Copy link
Member

Reproduced the hang at a fyre machine,

export GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVEC,-XSAVE,-AVX2,-ERMS,-AVX,-AVX_Fast_Unaligned_Load
export LD_BIND_NOT=on
jdk21-20241120-173702/bin/java -XX:+EnableCRIUSupport  -XX:+DebugOnRestore -Xjit  -cp /root/test/jdk21-1122/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.OptionsFileTest JitOptionsTest -Xnojit -Xnoaot 1
... checkpoint ...
criu restore -D ./cpData -v2 --shell-job

Attach gdb to the hang java process and the native stacktrace are

(gdb) info threads
  Id   Target Id                                           Frame 
* 1    Thread 0x3ff923f3720 (LWP 109897) "java"            0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  2    Thread 0x3ff923f2840 (LWP 109898) "main"            0x000003ff8bcc0098 in jitRetranslateMethod () at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/build/linux-s390x-server-release/vm/runtime/codert_vm/znathelp.s:8914
  3    Thread 0x3ff922ff840 (LWP 109899) "Signal Reporter" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  4    Thread 0x3ff901fe840 (LWP 109900) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  5    Thread 0x3ff6f47f840 (LWP 109901) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  6    Thread 0x3ff6f37e840 (LWP 109902) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  7    Thread 0x3ff6f27d840 (LWP 109903) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  8    Thread 0x3ff6f17c840 (LWP 109904) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  9    Thread 0x3ff6f07b840 (LWP 109905) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  10   Thread 0x3ff6ef7a840 (LWP 109906) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  11   Thread 0x3ff6ee79840 (LWP 109907) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  12   Thread 0x3ff6ed78840 (LWP 109908) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  13   Thread 0x3ff6ec77840 (LWP 109909) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  14   Thread 0x3ff6eb76840 (LWP 109910) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  15   Thread 0x3ff6ea75840 (LWP 109911) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  16   Thread 0x3ff6e974840 (LWP 109912) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  17   Thread 0x3ff6e873840 (LWP 109913) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  18   Thread 0x3ff6e772840 (LWP 109914) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  19   Thread 0x3ff6e671840 (LWP 109915) "JIT Compilation" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  20   Thread 0x3ff921ff840 (LWP 109916) "JIT Sampler"     0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  21   Thread 0x3ff921be840 (LWP 109917) "JIT IProfiler"   0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  22   Thread 0x3ff9217f840 (LWP 109918) "CR Runtime"      0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  23   Thread 0x3ff91a7f840 (LWP 109919) "Common-Cleaner"  0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  24   Thread 0x3ff919ff840 (LWP 109920) "Finalizer main"  0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  25   Thread 0x3ff916ff840 (LWP 109921) "Concurrent Mark" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  26   Thread 0x3ff915ff840 (LWP 109922) "GC Worker"       0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  27   Thread 0x3ff9157f840 (LWP 109923) "GC Worker"       0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  28   Thread 0x3ff914ff840 (LWP 109924) "GC Worker"       0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  29   Thread 0x3ff913ff840 (LWP 109926) "Attach API upda" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  30   Thread 0x3ff90fc1840 (LWP 109927) "Attach API wait" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
  31   Thread 0x3ff9147f840 (LWP 109928) "Finalizer threa" 0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6

(gdb) thread 2
[Switching to thread 2 (Thread 0x3ff923f2840 (LWP 109898))]
#0  0x000003ff8bcc0098 in jitRetranslateMethod () at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/build/linux-s390x-server-release/vm/runtime/codert_vm/znathelp.s:8914
8914	/home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/build/linux-s390x-server-release/vm/runtime/codert_vm/znathelp.s: No such file or directory.
(gdb) bt
#0  0x000003ff8bcc0098 in jitRetranslateMethod () at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/build/linux-s390x-server-release/vm/runtime/codert_vm/znathelp.s:8914
#1  0x000003ff9171e0ea in runStaticMethod (currentThread=0x3ff00000001, className=0x1 <error: Cannot access memory at address 0x1>, selector=0x3ff923efd08, argCount=0, arguments=<optimized out>)
    at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/callin.cpp:721
#2  0x000003ff917ae3e6 in jvmRestoreHooks (currentThread=currentThread@entry=0x318800) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/CRIUHelpers.cpp:119
#3  0x000003ff917b27c2 in criuCheckpointJVMImpl (env=<optimized out>, imagesDir=<optimized out>, leaveRunning=<optimized out>, shellJob=<optimized out>, extUnixSupport=<optimized out>, logLevel=4, logFile=0x0, fileLocks=1 '\001', workDir=0x0, 
    tcpEstablished=0 '\000', autoDedup=0 '\000', trackMemory=0 '\000', unprivileged=0 '\000', optionsFile=0x42c7b0, environmentFile=0x0, ghostFileLimit=-1) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/CRIUHelpers.cpp:1994
#4  0x000003ff907e5076 in Java_openj9_internal_criu_InternalCRIUSupport_checkpointJVMImpl (env=<optimized out>, unused=<optimized out>, imagesDir=<optimized out>, leaveRunning=<optimized out>, shellJob=<optimized out>, extUnixSupport=0 '\000', logLevel=4, 
    logFile=0x0, fileLocks=1 '\001', workDir=0x0, tcpEstablished=0 '\000', autoDedup=0 '\000', trackMemory=0 '\000', unprivileged=0 '\000', optionsFile=0x42c7b0, environmentFile=0x0, ghostFileLimit=-1)
    at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/jcl/common/criu.cpp:125
#5  0x000003ff91932d1c in ffi_call_SYSV () from /root/test-hang/openj9build/jdk21-20241120-173702/lib/default/libj9vm29.so
#6  0x000003ff91932772 in ffi_call (cif=<optimized out>, fn=<optimized out>, rvalue=<optimized out>, avalue=<optimized out>) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/libffi/s390/ffi.c:532
#7  0x000003ff918357ca in VM_DebugBytecodeInterpreterCompressed::cJNICallout (isStatic=true, 
    function=0x3ff907e4fc0 <Java_openj9_internal_criu_InternalCRIUSupport_checkpointJVMImpl(JNIEnv*, jclass, jstring, jboolean, jboolean, jboolean, jint, jstring, jboolean, jstring, jboolean, jboolean, jboolean, jboolean, jstring, jstring, jlong)>, 
    returnStorage=0x318920, returnType=<synthetic pointer>, javaArgs=<optimized out>, receiverAddress=<optimized out>, _pc=<synthetic pointer>: <optimized out>, _sp=<synthetic pointer>: <optimized out>, this=0x3ff923f1040)
    at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/BytecodeInterpreter.hpp:2612
#8  VM_DebugBytecodeInterpreterCompressed::callCFunction (returnType=<synthetic pointer>, isStatic=true, bp=<synthetic pointer>, javaArgs=<optimized out>, receiverAddress=<optimized out>, 
    jniMethodStartAddress=0x3ff907e4fc0 <Java_openj9_internal_criu_InternalCRIUSupport_checkpointJVMImpl(JNIEnv*, jclass, jstring, jboolean, jboolean, jboolean, jint, jstring, jboolean, jstring, jboolean, jboolean, jboolean, jboolean, jstring, jstring, jlong)>, 
    _pc=<synthetic pointer>: <optimized out>, _sp=<synthetic pointer>: <optimized out>, this=0x3ff923f1040) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/BytecodeInterpreter.hpp:2430
#9  VM_DebugBytecodeInterpreterCompressed::runJNINative (_pc=<synthetic pointer>: <optimized out>, _sp=<synthetic pointer>: <optimized out>, this=0x3ff923f1040) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/BytecodeInterpreter.hpp:2318
#10 VM_DebugBytecodeInterpreterCompressed::run (this=this@entry=0x3ff923f1188, vmThread=<optimized out>) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/BytecodeInterpreter.hpp:10970
#11 0x000003ff9180fa52 in debugBytecodeLoopCompressed (currentThread=<optimized out>) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/BytecodeInterpreter.inc:112
#12 0x000003ff918ab3f4 in c_cInterpreter () at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/build/linux-s390x-server-release/vm/runtime/vm/zcinterp.s:278
#13 0x000003ff91720fc2 in runCallInMethod (env=0x318800, receiver=<optimized out>, clazz=0x42c9d0, methodID=0x3ff8c165928, args=0x3ff923f1d58) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/callin.cpp:1174
#14 0x000003ff9174a7cc in gpProtectedRunCallInMethod (entryArg=0x3ff923f1c90) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/jnicsup.cpp:300
#15 0x000003ff91632c50 in omrsig_protect (portLibrary=0x3ff91ed2000 <j9portLibrary>, fn=0x3ff918bab78 <signalProtectAndRunGlue>, fn_arg=0x3ff923f1be0, handler=0x3ff91746ba8 <structuredSignalHandler>, handler_arg=0x318800, flags=506, result=0x3ff923f1bd8)
    at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/omr/port/unix/omrsignal.c:425
#16 0x000003ff918bac28 in gpProtectAndRun (function=0x3ff9174a738 <gpProtectedRunCallInMethod(void*)>, env=0x318800, args=0x3ff923f1c90) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/util/jniprotect.c:78
#17 0x000003ff9174c8d6 in gpCheckCallin (env=<optimized out>, receiver=receiver@entry=0x0, cls=<optimized out>, methodID=<optimized out>, args=args@entry=0x3ff923f1d58) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/jnicsup.cpp:488
#18 0x000003ff9174a0ba in callStaticVoidMethod (env=<optimized out>, cls=<optimized out>, methodID=<optimized out>) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/vm/jnicgen.c:384
#19 0x000003ff92206706 in JavaMain (_args=<optimized out>) at src/java.base/share/native/libjli/java.c:633
#20 0x000003ff92209480 in ThreadJavaMain (args=<optimized out>) at src/java.base/unix/native/libjli/java_md.c:691
#21 0x000003ff91f96296 in ?? () from /lib/s390x-linux-gnu/libc.so.6
#22 0x000003ff9200ff8e in ?? () from /lib/s390x-linux-gnu/libc.so.6

(gdb) thread 22
[Switching to thread 22 (Thread 0x3ff9217f840 (LWP 109918))]
#0  0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
(gdb) bt
#0  0x000003ff91f92ae6 in ?? () from /lib/s390x-linux-gnu/libc.so.6
#1  0x000003ff91f956fa in pthread_cond_wait () from /lib/s390x-linux-gnu/libc.so.6
#2  0x000003ff91587d56 in monitor_wait_original (self=0x3ff8c0ff5f0, monitor=0x3ff8c075fa8, millis=<optimized out>, nanos=<optimized out>, interruptible=0) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/omr/thread/common/omrthread.c:4686
#3  0x000003ff9158b8b0 in monitor_wait (interruptible=0, nanos=0, millis=0, monitor=0x3ff8c075fa8) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/omr/thread/common/omrthread.c:4531
#4  omrthread_monitor_wait (monitor=0x3ff8c075fa8) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/omr/thread/common/omrthread.c:4401
#5  0x000003ff8b658480 in TR::CRRuntime::process (this=this@entry=0x3ff901ff0c0) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/compiler/runtime/CRRuntime.cpp:836
#6  0x000003ff8b6585a0 in crRuntimeThreadProc (entryarg=<optimized out>) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/openj9/runtime/compiler/runtime/CRRuntime.cpp:895
#7  0x000003ff91588c8e in thread_wrapper (arg=0x3ff8c0ff5f0) at /home/jenkins/workspace/Build_JDK21_s390x_linux_Nightly/omr/thread/common/omrthread.c:1724
#8  0x000003ff91f96296 in ?? () from /lib/s390x-linux-gnu/libc.so.6
#9  0x000003ff9200ff8e in ?? () from /lib/s390x-linux-gnu/libc.so.6

FYI @dsouzai

@dsouzai
Copy link
Contributor

dsouzai commented Nov 25, 2024

Hm, I spent a good chunk of time trying to reason about this and I'm not sure what the root cause could be. Thread 22 above isn't really doing anything wrong. However, Thread 2 I don't know why it's calling jitRetranslateMethod, or rather, I may know but then the theories that I come up with don't explain why the issue only happens sometimes.

I'll have to repro the problem myself to get my hands on more context.

@dsouzai
Copy link
Contributor

dsouzai commented Nov 26, 2024

Set up my own fyre machine and was able to repro the issue similar to Jason.

I think I know what's going on. The hang, I believe, happens because the thread is stuck in a loop of trying to trigger a recompile (which does not succeed because of -Xnoaot -Xnojit post restore).

Normally what happens is, if the recompile request fails, the start PC gets patched to jump to the interpreter. In this case though, things are more complicated because of -XX:+DebugOnRestore. Order of events:

  1. Pre-checkpoint, foo gets compiled with FSD enabled.
  2. In the checkpoint hook, foo gets recompiled with FSD disabled; the startPC of the FSD body gets patched to jump to a helper that forwards to the new body.
  3. Post-restore, -Xnoaot -Xnojit results in all bodies getting invalidated - it is important (and relevant) to note that only the bodies specified by the extra field of the J9Method are considered; old bodies are not touched.
  4. Post-restore, the new body is invoked; this causes a recomp request which fails, and it ends up in the interpreter, which (because the extra field gets initialized to 0x1) queues the method for compilation, which eventually fails because -Xnoaot -Xnojit.
  5. The JIT sets the extra field to J9_JIT_NEVER_TRANSLATE and the startPC is patched to jump to the interpreter.
  6. Some other compiled method, bar, which has a direct call to the old body of foo invokes foo
  7. This results in a jump to the helper that is supposed to forward execution to the new body; however, because the extra field is now -3, I believe what happens is it thinks that the method is still interpreted and so tries to trigger a recompilation (haven't actually confirmed this but it seems reasonable and aligns with the observed behaviour during the hang)
  8. However, because this old body of foo was already recompiled, [1] goes into effect which means this old body never gets a chance to have its startPC updated to jump to the interpreter.
  9. The thread essentially returns back to the startPC, returning to step 7.

This scenario can never happen in a normal run because if a similar situation happened because of say HCR, no one would be able to invoke the old foo, because the Runtime Assumptions would have compensated those call sites.

To solve this, in J9::OptionsPostRestore::invalidateCompiledMethod, instead of calling TR::Recompilation::invalidateMethodBody, I may need to call J9::Recompilation::methodCannotBeRecompiled for all compiled bodies. However I'll have to think about it some more.

I suppose depending on what we want the semantics of -Xnojit -Xnoaot to be post-restore, decompiling in the Restore Hook would guarantee no JIT'd code will ever execute post-restore.

[1]

if (linkageInfo->isBeingCompiled())
{
TR_J9VMBase *fe = TR_J9VMBase::get(jitConfig, vmThread);
if (fe->isAsyncCompilation())
return 0; // early return because the method is queued for compilation
}

@dsouzai dsouzai assigned dsouzai and unassigned JasonFengJ9 Nov 26, 2024
@pshipton
Copy link
Member Author

pshipton commented Nov 26, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_ppc64le_linux_Nightly_testList_0/658 - ubu22-ppc64le-5
cmdLineTester_criu_nonPortableRestore_2

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk11_j9_sanity.functional_ppc64le_linux_Nightly_testList_0/658/functional_test_output.tar.gz

23:39:13  Testing: Create CRIU checkpoint image and restore once - testTimeCompensation
23:39:13  Test start time: 2024/11/23 04:39:13 Coordinated Universal Time
23:39:13  Running command: bash /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_ppc64le_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_ppc64le_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_ppc64le_linux_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java " -XX:+JVMPortableRestoreMode  -Xtrace:print={j9jcl.533,j9vm.684-696,j9vm.699,j9vm.717-743} --add-exports java.base/openj9.internal.criu=ALL-UNNAMED" org.openj9.criu.TimeChangeTest testTimeCompensation 1 false false
23:39:13  Time spent starting: 1 milliseconds
23:44:20  ***[TEST INFO 2024/11/23 04:44:13] ProcessKiller detected a timeout after 300000 milliseconds!***
23:44:20  ***[TEST INFO 2024/11/23 04:44:13] executing /usr/bin/gdb -batch -x /tmp/debugger4681185359812964385.txt bash 3763082***
23:44:20  GDB OUT [Thread debugging using libthread_db enabled]
23:44:20  GDB OUT Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
23:44:20  GDB OUT 0x00007e8175f1edf8 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
23:44:20  GDB OUT From                To                  Syms Read   Shared Object Library
23:44:20  GDB OUT 0x00007e817609d620  0x00007e81760b8128  Yes (*)     /lib/powerpc64le-linux-gnu/libtinfo.so.6
23:44:20  GDB OUT 0x00007e8175e2cc00  0x00007e8175ff3c3c  Yes         /lib/powerpc64le-linux-gnu/libc.so.6
23:44:20  GDB OUT 0x00007e8176121f00  0x00007e81761635ac  Yes         /lib64/ld64.so.2
23:44:20  GDB OUT (*): Shared library is missing debugging information.
23:44:20  GDB OUT r0             0x72                114
23:44:20  GDB OUT r1             0x7fffe8b3ad90      140737097477520
23:44:20  GDB OUT r2             0x7e8176086e00      139094496144896
23:44:20  GDB OUT r3             0xfffffffffffffe00  18446744073709551104
23:44:20  GDB OUT r4             0x7fffe8b3ae90      140737097477776
23:44:20  GDB OUT r5             0x0                 0
23:44:20  GDB OUT r6             0x0                 0
23:44:20  GDB OUT r7             0x396b8e            3763086
23:44:20  GDB OUT r8             0x229062debb0       2375220587440
23:44:20  GDB OUT r9             0x0                 0
23:44:20  GDB OUT r10            0x0                 0
23:44:20  GDB OUT r11            0x0                 0
23:44:20  GDB OUT r12            0x0                 0
23:44:20  GDB OUT r13            0x7e817619b0e0      139094497276128
23:44:20  GDB OUT r14            0x228f47b8fb0       2374923685808
23:44:20  GDB OUT r15            0x228f47bc7e4       2374923700196
23:44:20  GDB OUT r16            0x0                 0
23:44:20  GDB OUT r17            0x0                 0
23:44:20  GDB OUT r18            0x7fffe8b3ae90      140737097477776
23:44:20  GDB OUT r19            0x228f47b8940       2374923684160
23:44:20  GDB OUT r20            0x228f47b1af0       2374923655920
23:44:20  GDB OUT r21            0x0                 0
23:44:20  GDB OUT r22            0x0                 0
23:44:20  GDB OUT r23            0x1                 1
23:44:20  GDB OUT r24            0x228f47bb400       2374923695104
23:44:20  GDB OUT r25            0x228f47bc540       2374923699520
23:44:20  GDB OUT r26            0x0                 0
23:44:20  GDB OUT r27            0xffff              65535
23:44:20  GDB OUT r28            0x228f47b04a8       2374923650216
23:44:20  GDB OUT r29            0x229062e42f0       2375220609776
23:44:20  GDB OUT r30            0x396b8e            3763086
23:44:20  GDB OUT r31            0x228f47b8e54       2374923685460
23:44:20  GDB OUT pc             0x7e8175f1edf8      0x7e8175f1edf8 <__GI___wait4+56>
23:44:20  GDB OUT msr            0x800000000000d033  9223372036854829107
23:44:20  GDB OUT cr             0x48224202          1210204674
23:44:20  GDB OUT lr             0x7e8175f1edf8      0x7e8175f1edf8 <__GI___wait4+56>
23:44:20  GDB OUT ctr            0x0                 0
23:44:20  GDB OUT xer            0x0                 0
23:44:20  GDB OUT fpscr          0x0                 0
23:44:20  GDB OUT vscr           0x0                 0
23:44:20  GDB OUT vrsave         0xffffffff          -1
23:44:20  GDB OUT ppr            0xc000000000000     3377699720527872
23:44:20  GDB OUT dscr           0x0                 0
23:44:20  GDB OUT tar            0x0                 0
23:44:20  GDB OUT bescr          <unavailable>
23:44:20  GDB OUT ebbhr          <unavailable>
23:44:20  GDB OUT ebbrr          <unavailable>
23:44:20  GDB OUT mmcr0          0x0                 0
23:44:20  GDB OUT mmcr2          0x0                 0
23:44:20  GDB OUT siar           0x0                 0
23:44:20  GDB OUT sdar           0x0                 0
23:44:20  GDB OUT sier           0x0                 0
23:44:20  GDB OUT orig_r3        0xffffffffffffffff  -1
23:44:20  GDB OUT trap           0x3000              12288
23:44:20  GDB OUT   Id   Target Id                                  Frame 
23:44:20  GDB OUT * 1    Thread 0x7e81761938a0 (LWP 3763082) "bash" 0x00007e8175f1edf8 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
23:44:20  GDB OUT 
23:44:20  GDB OUT Thread 1 (Thread 0x7e81761938a0 (LWP 3763082) "bash"):
23:44:20  GDB OUT #0  0x00007e8175f1edf8 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
23:44:20  GDB OUT         sc_ret = -512
23:44:20  GDB OUT         sc_ret = <optimized out>
23:44:20  GDB OUT #1  0x00007e8175f1ed4c in __GI___waitpid (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>) at waitpid.c:38
23:44:20  GDB OUT No locals.
23:44:20  GDB OUT #2  0x00000228f462c1b4 in wait_for ()
23:44:20  GDB OUT No symbol table info available.
23:44:20  GDB OUT #3  0x00000228f45fba3c in execute_command_internal ()
23:44:20  GDB OUT No symbol table info available.
23:44:20  GDB OUT #4  0x00000228f45fc080 in execute_command ()
23:44:20  GDB OUT No symbol table info available.
23:44:20  GDB OUT #5  0x00000228f45e13c0 in reader_loop ()
23:44:20  GDB OUT No symbol table info available.
23:44:20  GDB OUT #6  0x00000228f45d0f6c in main ()
23:44:20  GDB OUT No symbol table info available.
23:44:20  GDB OUT Saved corefile core.3763082.1
23:44:20  GDB OUT [Inferior 1 (process 3763082) detached]
23:44:20  INFO: Running '/usr/bin/gdb' failed with rc = 1
23:44:20  GDB ERR 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
23:44:20  GDB ERR warning: target file /proc/3763082/cmdline contained unexpected null characters
23:44:20  GDB ERR /tmp/debugger4681185359812964385.txt:6: Error in sourced command file:
23:44:20  GDB ERR Requires argument (inferior id(s) to detach)
23:44:20  
23:44:20  INFO: Sleep for 60000 millis before next capture.
23:45:16  ***[TEST INFO 2024/11/23 04:45:14] executing /usr/bin/gdb -batch -x /tmp/debugger4681185359812964385.txt bash 3763082***
23:45:16  GDB OUT [Thread debugging using libthread_db enabled]
23:45:16  GDB OUT Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
23:45:16  GDB OUT 0x00007e8175f1edf8 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
23:45:16  GDB OUT From                To                  Syms Read   Shared Object Library
23:45:16  GDB OUT 0x00007e817609d620  0x00007e81760b8128  Yes (*)     /lib/powerpc64le-linux-gnu/libtinfo.so.6
23:45:16  GDB OUT 0x00007e8175e2cc00  0x00007e8175ff3c3c  Yes         /lib/powerpc64le-linux-gnu/libc.so.6
23:45:16  GDB OUT 0x00007e8176121f00  0x00007e81761635ac  Yes         /lib64/ld64.so.2
23:45:16  GDB OUT (*): Shared library is missing debugging information.
23:45:16  GDB OUT r0             0x72                114
23:45:16  GDB OUT r1             0x7fffe8b3ad90      140737097477520
23:45:16  GDB OUT r2             0x7e8176086e00      139094496144896
23:45:16  GDB OUT r3             0xfffffffffffffe00  18446744073709551104
23:45:16  GDB OUT r4             0x7fffe8b3ae90      140737097477776
23:45:16  GDB OUT r5             0x0                 0
23:45:16  GDB OUT r6             0x0                 0
23:45:16  GDB OUT r7             0x396b8e            3763086
23:45:16  GDB OUT r8             0x229062debb0       2375220587440
23:45:16  GDB OUT r9             0x0                 0
23:45:16  GDB OUT r10            0x0                 0
23:45:16  GDB OUT r11            0x0                 0
23:45:16  GDB OUT r12            0x0                 0
23:45:16  GDB OUT r13            0x7e817619b0e0      139094497276128
23:45:16  GDB OUT r14            0x228f47b8fb0       2374923685808
23:45:16  GDB OUT r15            0x228f47bc7e4       2374923700196
23:45:16  GDB OUT r16            0x0                 0
23:45:16  GDB OUT r17            0x0                 0
23:45:16  GDB OUT r18            0x7fffe8b3ae90      140737097477776
23:45:16  GDB OUT r19            0x228f47b8940       2374923684160
23:45:16  GDB OUT r20            0x228f47b1af0       2374923655920
23:45:16  GDB OUT r21            0x0                 0
23:45:16  GDB OUT r22            0x0                 0
23:45:16  GDB OUT r23            0x1                 1
23:45:16  GDB OUT r24            0x228f47bb400       2374923695104
23:45:16  GDB OUT r25            0x228f47bc540       2374923699520
23:45:16  GDB OUT r26            0x0                 0
23:45:16  GDB OUT r27            0xffff              65535
23:45:16  GDB OUT r28            0x228f47b04a8       2374923650216
23:45:16  GDB OUT r29            0x229062e42f0       2375220609776
23:45:16  GDB OUT r30            0x396b8e            3763086
23:45:16  GDB OUT r31            0x228f47b8e54       2374923685460
23:45:16  GDB OUT pc             0x7e8175f1edf8      0x7e8175f1edf8 <__GI___wait4+56>
23:45:16  GDB OUT msr            0x800000000000d033  9223372036854829107
23:45:16  GDB OUT cr             0x48224202          1210204674
23:45:16  GDB OUT lr             0x7e8175f1edf8      0x7e8175f1edf8 <__GI___wait4+56>
23:45:16  GDB OUT ctr            0x0                 0
23:45:16  GDB OUT xer            0x0                 0
23:45:16  GDB OUT fpscr          0x0                 0
23:45:16  GDB OUT vscr           0x0                 0
23:45:16  GDB OUT vrsave         0xffffffff          -1
23:45:16  GDB OUT ppr            0xc000000000000     3377699720527872
23:45:16  GDB OUT dscr           0x0                 0
23:45:16  GDB OUT tar            0x0                 0
23:45:16  GDB OUT bescr          <unavailable>
23:45:16  GDB OUT ebbhr          <unavailable>
23:45:16  GDB OUT ebbrr          <unavailable>
23:45:16  GDB OUT mmcr0          0x0                 0
23:45:16  GDB OUT mmcr2          0x0                 0
23:45:16  GDB OUT siar           0x0                 0
23:45:16  GDB OUT sdar           0x0                 0
23:45:16  GDB OUT sier           0x0                 0
23:45:16  GDB OUT orig_r3        0xffffffffffffffff  -1
23:45:16  GDB OUT trap           0x3000              12288
23:45:16  GDB OUT   Id   Target Id                                  Frame 
23:45:16  GDB OUT * 1    Thread 0x7e81761938a0 (LWP 3763082) "bash" 0x00007e8175f1edf8 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
23:45:16  GDB OUT 
23:45:16  GDB OUT Thread 1 (Thread 0x7e81761938a0 (LWP 3763082) "bash"):
23:45:16  GDB OUT #0  0x00007e8175f1edf8 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
23:45:16  GDB OUT         sc_ret = -512
23:45:16  GDB OUT         sc_ret = <optimized out>
23:45:16  GDB OUT #1  0x00007e8175f1ed4c in __GI___waitpid (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>) at waitpid.c:38
23:45:16  GDB OUT No locals.
23:45:16  GDB OUT #2  0x00000228f462c1b4 in wait_for ()
23:45:16  GDB OUT No symbol table info available.
23:45:16  GDB OUT #3  0x00000228f45fba3c in execute_command_internal ()
23:45:16  GDB OUT No symbol table info available.
23:45:16  GDB OUT #4  0x00000228f45fc080 in execute_command ()
23:45:16  GDB OUT No symbol table info available.
23:45:16  GDB OUT #5  0x00000228f45e13c0 in reader_loop ()
23:45:16  GDB OUT No symbol table info available.
23:45:16  GDB OUT #6  0x00000228f45d0f6c in main ()
23:45:16  GDB OUT No symbol table info available.
23:45:16  GDB OUT Saved corefile core.3763082.2
23:45:16  GDB OUT A debugging session is active.
23:45:16  GDB OUT 
23:45:16  GDB OUT 	Inferior 1 [process 3763082] will be detached.
23:45:16  GDB OUT 
23:45:16  GDB OUT Quit anyway? (y or n) [answered Y; input not from terminal]
23:45:16  GDB OUT [Inferior 1 (process 3763082) detached]
23:45:16  ***[TEST INFO 2024/11/23 04:45:14] executing kill -ABRT 3763082***
23:45:16  Time spent executing: 360653 milliseconds
23:45:16  Test result: FAILED
23:45:16  Output from test:
23:45:16  ***[TEST INFO 2024/11/23 04:45:14] kill -ABRT signal sent***
23:45:16   [OUT] start running script
23:45:16   [OUT] export GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVEC,-XSAVE,-AVX2,-ERMS,-AVX,-AVX_Fast_Unaligned_Load
23:45:16   [OUT] export LD_BIND_NOT=on
23:45:16   [OUT] /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_ppc64le_linux_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java -XX:+EnableCRIUSupport  -XX:+JVMPortableRestoreMode  -Xtrace:print={j9jcl.533,j9vm.684-696,j9vm.699,j9vm.717-743} --add-exports java.base/openj9.internal.criu=ALL-UNNAMED -cp /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_ppc64le_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.TimeChangeTest testTimeCompensation 1
23:45:16  
23:45:16  ***[TEST INFO 2024/11/23 04:45:14] ABRT completed***
23:45:16  Output from test:

@jdmpapin
Copy link
Contributor

jdmpapin commented Nov 26, 2024

This scenario can never happen in a normal run because if a similar situation happened because of say HCR, no one would be able to invoke the old foo, because the Runtime Assumptions would have compensated those call sites.

HCR guards only protect inlined code, not calls. When a class is redefined, we invalidate its JIT bodies:

// Step 2 invalidate methods that are already compiled and trigger a new compilation.
if (staleMethod && freshMethod && (startPC = compInfo->getPCIfCompiled(staleMethod)))
{
// Update the ram method information in PersistentMethodInfo
TR_PersistentJittedBodyInfo *bodyInfo = TR::Recompilation::getJittedBodyInfoFromPC(startPC);
if (bodyInfo)
{
reportHookDetail(currentThread, "jitClassesRedefined", " Invalidate method body stale=%p startPC=%p", staleMethod, startPC);
TR::Recompilation::invalidateMethodBody(startPC, fe);
bodyInfo->setDisableSampling(true);
TR_PersistentMethodInfo *pmi = bodyInfo->getMethodInfo();
if (pmi)
{
pmi->setHasBeenReplaced();
}
// Note: we don't do pmi->setMethodInfo here because (a) they'll be patched by runtime
// assumptions anyway, and (b) we can only get our hands on the latest PMI, not all the
// older ones.
}
// Same as TR_ResolvedMethod::isNative without allocating Scratch Memory
else if(_J9ROMMETHOD_J9MODIFIER_IS_SET((J9_ROM_METHOD_FROM_RAM_METHOD(staleMethod)), J9AccNative ))
{
reportHookDetail(currentThread, "jitClassesRedefined", "No need to invalidate native method stale=%p startPC=%p", staleMethod, startPC);
}
else
{
reportHookDetail(currentThread, "jitClassesRedefined", "WARNING! Cannot invalidate method body stale=%p startPC=%p", staleMethod, startPC);
TR_ASSERT(0,"JIT HCR should make all methods recompilable, so startPC=%p should have a persistentBodyInfo", startPC);
}
}

There could be direct calls to these bodies from others

Preexistence invalidation can do the same thing

I have changes in the works to make class unloading do the same thing as well, when a surviving JIT body has unloaded methods inlined into it (for #16616)

To solve this, in J9::OptionsPostRestore::invalidateCompiledMethod, instead of calling TR::Recompilation::invalidateMethodBody, I may need to call J9::Recompilation::methodCannotBeRecompiled for all compiled bodies. However I'll have to think about it some more.

Since a JIT body could be invalidated for other reasons, and there could be direct calls to it, I don't think this will work. Is it maybe possible to fix it in step 7? It seems to me that the problem is that the invalidated body fails to forward to the body that is supposed to have replaced it. If it did still forward, then we'd call the old body, which would jump to the new body, which would in turn jump to the interpreter

@JamesKingdon
Copy link
Contributor

JamesKingdon commented Nov 26, 2024

I happened to be looking at this recently, and I thought I'd mention that under FSD there's a completely separate path branched from the conditional at

if (!TR::Options::getCmdLineOptions()->getOption(TR_FullSpeedDebug))

@dsouzai
Copy link
Contributor

dsouzai commented Nov 27, 2024

To solve this, in J9::OptionsPostRestore::invalidateCompiledMethod, instead of calling TR::Recompilation::invalidateMethodBody, I may need to call J9::Recompilation::methodCannotBeRecompiled for all compiled bodies. However I'll have to think about it some more.

Since a JIT body could be invalidated for other reasons, and there could be direct calls to it, I don't think this will work.

Why wouldn't this work?

For context, this is related to Options Processing Post-Restore; if we see -Xnojit -Xnoaot, I've taken the meaning to be "prevent further compilation, minimize execution of existing code". The complication comes from the fact that we generated FSD code pre-checkpoint, but also in the pre-checkpoint hook when all threads are stopped, we recompile those FSD bodies proactively and update the extra field (and I guess patch the old body's startPC to jump to the helper to forward the call). It is worth noting that at this point, all threads only have FSD bodies on their stacks.

If post-restore some method makes a direct call to the old body AFTER

  1. the new recompiled body get patched to jump to the interpreter AND
  2. the interpreter makes a comp request (because someone invoked the new body) AND
  3. the comp request fails (because we don't allow any compilation) and the extra is now updated to -3

the forwarding helper of the old body will not be able to find the new body; it doesn't matter that the new body's startPC was appropriately patched to transition to the interpreter. This is due to the fact that calling TR::Recompilation::invalidateMethodBody does not immediately result in the startPC getting patched to transition to the interpreter.

So, in the case where I want to prevent execution to the extent possible in the Post-Restore Options Processing code when we see -Xnojit -Xnoaot, if I call J9::Recompilation::methodCannotBeRecompiled on ALL compiled bodies, then the startPC will always transition to the interpreter regardless of whether it was a direct call to the old body or an invocation of the new body.

I'm not suggesting to do this anywhere else; it only makes sese to do so in this very specific case; location [1] for context.

[1]

TR_PersistentMethodInfo *pmi = bodyInfo->getMethodInfo();
pmi->setIsExcludedPostRestore();
TR::Recompilation::invalidateMethodBody(startPC, fej9);

@jdmpapin
Copy link
Contributor

Sorry, "I don't think this will work" was too strong. I agree with your reasoning that your strategy would prevent this problem when compilation fails due to -Xnojit -Xnoaot post-restore:

when we see -Xnojit -Xnoaot, if I call J9::Recompilation::methodCannotBeRecompiled on ALL compiled bodies, then the startPC will always transition to the interpreter

at least assuming that by ALL compiled bodies you mean including the FSD bodies that have been replaced by the pre-checkpoint hook

What I should have said is that I thought it would still be possible to get stuck in a loop that's essentially the same as your steps 7-9 if the state were set up by other kinds of invalidation and by compilation failure for some reason other than the post-restore options -Xnojit -Xnoaot, e.g. by involving HCR and/or preexistence even outside of a CRIU scenario

However, I tried to observe such a thing this morning and I haven't been able to - my bad. I guess I thought that the statement I originally quoted was important for preventing it, i.e. I thought it would be a problem if you could run a direct call to an old body invalidated due to e.g. HCR, which you can. But it seems that that isn't a problem after all. Here are the differences in a normal run AFAICT:

  • It looks like we only set J9_JIT_NEVER_TRANSLATE when a first-time compilation fails. If a JIT body is invalidated due to preexistence (say), then any later failure to compile that method is a recompilation failure that does not set J9_JIT_NEVER_TRANSLATE
  • In the case of HCR invalidation, the new definition is actually a new J9Method. I think a compilation of the new method probably doesn't count as a recompilation of the original one for the purpose of your step 8

So... carry on then 😅

@dsouzai
Copy link
Contributor

dsouzai commented Nov 28, 2024

I spoke with Devin offline yesterday about a scenario I came up with where this hang could happen even outside of CRIU. Essentially I was wondering if a method getting invalidated because of prex could lead to the same set of circumstances where you have:

  1. Original compile
  2. Recompile (with prex)
  3. Prex invalidation triggers another recompilation, which fails
  4. Body in 2 is patched to revert to the interpreter, which resets the extra, which triggers a first time compile, which also fails, thus leading to the extra to get set to -3
  5. Some other method has a direct call to the body in 1, leading to the hang seen in this issue.

Devin had a test case that I was able to use to recreate the hang outside of CRIU.

The reason this has been happening recently is because of #20387 where initializeMethodRunAddressNoHook was updated with

method->extra = (void *) J9_STARTPC_NOT_TRANSLATED;

which had follow on consequences in this very niche set of circumstances.

Now, the only reason the hang is possible is because of

// Async compilation is disabled or recompilations triggered from jitted code
//
// Check if the method has already been scheduled for compilation and return
// abruptly if so. This will reduce contention on the optimizationPlanMonitor
// when a profiled compilation is followed by a very slow recompilation
// (until the recompilation is over, the java threads will try to trigger
// recompilations again and again. See defect 193193)
//
if (linkageInfo->isBeingCompiled())
{
TR_J9VMBase *fe = TR_J9VMBase::get(jitConfig, vmThread);
if (fe->isAsyncCompilation())
return 0; // early return because the method is queued for compilation
}

Essentially, once we set IsBeingRecompiled on the LinkageInfo, we never reset it, which is why we do the early return above; without the early return we would have attempted a compilation, which if it failed, would have patched the startPC to revert to the interpreter.

So, I think, rather than my previous suggestion of calling J9::Recompilation::methodCannotBeRecompiled, I should just call linkageInfo->resetIsBeingRecompiled() in J9::Recompilation::methodHasBeenRecompiled. That should fix both the hang seen in this issue and the hang in Devin's test case.

@dsouzai
Copy link
Contributor

dsouzai commented Dec 2, 2024

So, I think, rather than my previous suggestion of calling J9::Recompilation::methodCannotBeRecompiled, I should just call linkageInfo->resetIsBeingRecompiled() in J9::Recompilation::methodHasBeenRecompiled. That should fix both the hang seen in this issue and the hang in Devin's test case.

Marius had a point that currently the linkage info flags are not synchronized, so while adding a flag is generally safe, resetting a flag could lead to some race conditions. So, will need to think about this some more...

@jdmpapin
Copy link
Contributor

jdmpapin commented Dec 2, 2024

Surely setting a flag is no more atomic than clearing one?

@dsouzai
Copy link
Contributor

dsouzai commented Dec 2, 2024

I think he was referring to it in the context of how those flags are used with the existing recomp mechanism; essentially, it seems that the infra has been shown to be valid (mostly, considering this and the prex issue) when going from unset to set, but it's not clear whether going from set to unset plays well.

So, I guess by "think about this some more", resetting flag isn't off the table, just that I will need to prove that it does not have any unintended consequences; given that a lot of this is relevant in Recomp.cpp, it's fairly dangerous code to modify lightly.

@dsouzai
Copy link
Contributor

dsouzai commented Dec 4, 2024

Found a better solution, see #20751 Actual solution: #20763

Copy link

Issue Number: 20663
Status: Closed
Actual Components: comp:jit, test failure, criu
Actual Assignees: No one :(
PR Assignees: No one :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jit criu Used to track CRIU snapshot related work test failure
Projects
None yet
7 participants