-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmdLineTester_criu_jitPostRestore Test -Xnojit -Xnoaot hang #20663
Comments
Issue Number: 20663 |
https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_s390x_linux_Nightly_testList_0/116 - rh8-390-2 |
@JasonFengJ9 Please take a look |
50x grinder - https://openj9-jenkins.osuosl.org/job/Grinder/3998/ |
The grinder #20663 (comment) 40/50 passed in 4 machines, 10/10 failed at |
Can rh8-390-2 be reboot? FYI @AdamBrousseau |
Done |
3x grinder on the same machine - https://openj9-jenkins.osuosl.org/job/Grinder/3999/ - all failed |
It also occurs on rh8-390-1 as per the latest failure, which I'm sure I've seen before, but didn't spend much time looking for it. |
Reproduced the hang at a fyre machine,
Attach gdb to the hang java process and the native stacktrace are
FYI @dsouzai |
Hm, I spent a good chunk of time trying to reason about this and I'm not sure what the root cause could be. Thread 22 above isn't really doing anything wrong. However, Thread 2 I don't know why it's calling I'll have to repro the problem myself to get my hands on more context. |
Set up my own fyre machine and was able to repro the issue similar to Jason. I think I know what's going on. The hang, I believe, happens because the thread is stuck in a loop of trying to trigger a recompile (which does not succeed because of Normally what happens is, if the recompile request fails, the start PC gets patched to jump to the interpreter. In this case though, things are more complicated because of
This scenario can never happen in a normal run because if a similar situation happened because of say HCR, no one would be able to invoke the old To solve this, in I suppose depending on what we want the semantics of [1] openj9/runtime/compiler/control/rossa.cpp Lines 326 to 331 in 395aab5
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_ppc64le_linux_Nightly_testList_0/658 - ubu22-ppc64le-5
|
HCR guards only protect inlined code, not calls. When a class is redefined, we invalidate its JIT bodies: openj9/runtime/compiler/control/HookedByTheJit.cpp Lines 2582 to 2611 in 395aab5
There could be direct calls to these bodies from others Preexistence invalidation can do the same thing I have changes in the works to make class unloading do the same thing as well, when a surviving JIT body has unloaded methods inlined into it (for #16616)
Since a JIT body could be invalidated for other reasons, and there could be direct calls to it, I don't think this will work. Is it maybe possible to fix it in step 7? It seems to me that the problem is that the invalidated body fails to forward to the body that is supposed to have replaced it. If it did still forward, then we'd call the old body, which would jump to the new body, which would in turn jump to the interpreter |
I happened to be looking at this recently, and I thought I'd mention that under FSD there's a completely separate path branched from the conditional at
|
Why wouldn't this work? For context, this is related to Options Processing Post-Restore; if we see If post-restore some method makes a direct call to the old body AFTER
the forwarding helper of the old body will not be able to find the new body; it doesn't matter that the new body's startPC was appropriately patched to transition to the interpreter. This is due to the fact that calling So, in the case where I want to prevent execution to the extent possible in the Post-Restore Options Processing code when we see I'm not suggesting to do this anywhere else; it only makes sese to do so in this very specific case; location [1] for context. [1] openj9/runtime/compiler/control/OptionsPostRestore.cpp Lines 439 to 442 in 395aab5
|
Sorry, "I don't think this will work" was too strong. I agree with your reasoning that your strategy would prevent this problem when compilation fails due to
at least assuming that by ALL compiled bodies you mean including the FSD bodies that have been replaced by the pre-checkpoint hook What I should have said is that I thought it would still be possible to get stuck in a loop that's essentially the same as your steps 7-9 if the state were set up by other kinds of invalidation and by compilation failure for some reason other than the post-restore options However, I tried to observe such a thing this morning and I haven't been able to - my bad. I guess I thought that the statement I originally quoted was important for preventing it, i.e. I thought it would be a problem if you could run a direct call to an old body invalidated due to e.g. HCR, which you can. But it seems that that isn't a problem after all. Here are the differences in a normal run AFAICT:
So... carry on then 😅 |
I spoke with Devin offline yesterday about a scenario I came up with where this hang could happen even outside of CRIU. Essentially I was wondering if a method getting invalidated because of prex could lead to the same set of circumstances where you have:
Devin had a test case that I was able to use to recreate the hang outside of CRIU. The reason this has been happening recently is because of #20387 where
which had follow on consequences in this very niche set of circumstances. Now, the only reason the hang is possible is because of openj9/runtime/compiler/control/rossa.cpp Lines 318 to 331 in 3817e9b
Essentially, once we set So, I think, rather than my previous suggestion of calling |
Marius had a point that currently the linkage info flags are not synchronized, so while adding a flag is generally safe, resetting a flag could lead to some race conditions. So, will need to think about this some more... |
Surely setting a flag is no more atomic than clearing one? |
I think he was referring to it in the context of how those flags are used with the existing recomp mechanism; essentially, it seems that the infra has been shown to be valid (mostly, considering this and the prex issue) when going from unset to set, but it's not clear whether going from set to unset plays well. So, I guess by "think about this some more", resetting flag isn't off the table, just that I will need to prove that it does not have any unintended consequences; given that a lot of this is relevant in Recomp.cpp, it's fairly dangerous code to modify lightly. |
Issue Number: 20663 |
I've seen this multiple times in the last few days, always on zlinux.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.functional_s390x_linux_Nightly_testList_0/358 - rh8-390-2
cmdLineTester_criu_jitPostRestore_3
The text was updated successfully, but these errors were encountered: