debug what happened with 14031 #14042

rmuir · 2024-12-04T20:56:21Z

Description

PR #14031 makes the float vector functions inlinable, but caused a drop in nightly benchmark.
We reverted the PR in #14041 , but we should try to figure it out, as we don't want unpredictable performance.

Maybe the problem is specific to AMD and we should test that. Also it would be great if we could safely add some debugging to the nightly benchmark to understand what happens with the compilation (e.g. -XX:+PrintCompilation or -XX:+LogCompilation).

The text was updated successfully, but these errors were encountered:

jpountz · 2024-12-06T09:51:25Z

I have a computer with a AMD Ryzen 9 3900X 12-Core Processor in case you want me to run some benchmarks.

ChrisHegarty · 2024-12-11T17:57:41Z

I've yet to do any real debugging, but my sense here is that the aggressive unrolling in that we do (4x with float dot), is hurting us when we inline. I'll try to reproduce on my Intel box, this might not be AMD specific (might just hurt more there). I wanna run luceneutil benchmarks rather than the micro-benchmarks.

rmuir · 2024-12-11T22:39:27Z

I've yet to do any real debugging, but my sense here is that the aggressive unrolling in that we do (4x with float dot), is hurting us when we inline. I'll try to reproduce on my Intel box, this might not be AMD specific (might just hurt more there). I wanna run luceneutil benchmarks rather than the micro-benchmarks.

what else are we to do though? cpus have multiple fma units, jvm won't unroll as it will change results of floating point.

ChrisHegarty · 2024-12-16T16:59:17Z

Hotspot will unroll the loops that are using the Vector API to do floating-point arithmetic. On my Intel box dotProductBody gets unrolled 4x, and since it is already hand-unrolled 4x, we get effectively 16x unrolling. e.g.

;; B49: #      out( B49 B50 ) <- in( B48 B49 ) Loop( B49-B49 inner main of N191 strip mined) Freq: 272.018
0x000079609ff2fa11:   vmovdqu32   zmm0,ZMMWORD PTR [rdx+rax*4+0x310]
0x000079609ff2fa1c:   vmovdqu32   zmm1,ZMMWORD PTR [rdx+rax*4+0x210]
0x000079609ff2fa27:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0xd0]
0x000079609ff2fa32:   vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0xd0]
0x000079609ff2fa3d:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x1d0]
0x000079609ff2fa48:   vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x1d0]
0x000079609ff2fa53:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x2d0]
0x000079609ff2fa5e:   vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x2d0]
0x000079609ff2fa69:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x3d0]
0x000079609ff2fa74:   vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x3d0]
0x000079609ff2fa7f:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x90]
0x000079609ff2fa8a:   vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x90]
0x000079609ff2fa95:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x190]
0x000079609ff2faa0:   vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x190]
0x000079609ff2faab:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x290]
0x000079609ff2fab6:   vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x290]
0x000079609ff2fac1:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x390]
0x000079609ff2facc:   vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x390]
0x000079609ff2fad7:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x50]
0x000079609ff2fae2:   vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x50]
0x000079609ff2faed:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x150]
0x000079609ff2faf8:   vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x150]
0x000079609ff2fb03:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x250]
0x000079609ff2fb0e:   vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x250]
0x000079609ff2fb19:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x350]
0x000079609ff2fb2f:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x10]
0x000079609ff2fb3a:   vfmadd231ps zmm2,zmm4,ZMMWORD PTR [rcx+rax*4+0x10]
0x000079609ff2fb45:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x110]
0x000079609ff2fb50:   vfmadd231ps zmm2,zmm4,ZMMWORD PTR [rcx+rax*4+0x110]
0x000079609ff2fb5b:   vfmadd231ps zmm2,zmm1,ZMMWORD PTR [rcx+rax*4+0x210]
0x000079609ff2fb66:   vfmadd231ps zmm2,zmm0,ZMMWORD PTR [rcx+rax*4+0x310]
0x000079609ff2fb71:   add    eax,0x100
0x000079609ff2fb76:   cmp    eax,ebp
0x000079609ff2fb78:   jl     0x000079609ff2fa11
;; B50: #      out( B48 B51 ) <- in( B49 )  Freq: 16.0002

Reducing the unrolling of dotProductBody, to 2x (e.g. draft PR) gives me a bit of an improvement.

Linux

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
main
VectorUtilBenchmark.floatDotProductVector     768  thrpt   75  31.888 ± 0.812  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  26.240 ± 0.550  ops/us

reduce unroll to x2
VectorUtilBenchmark.floatDotProductVector     768  thrpt   75  35.129 ± 0.749  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  28.060 ± 0.619  ops/us

reduce unroll to x2 AND first is mul (rather than FMA)
VectorUtilBenchmark.floatDotProductVector     768  thrpt   75  37.100 ± 0.726  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  29.172 ± 0.514  ops/us

rmuir · 2024-12-16T23:52:01Z

Do you know why we see blocks of 4 FMA insns such as vfmadd231ps zmm6,zmm4 using the same registers? This isn't helpful and seems to create unnecessary data dependencies? So if that's "jvm unrolling" then it isn't good enough to count for our purpose?

rmuir · 2024-12-17T13:19:13Z

I think we are having a communication issue over terminology. I don't care about unrolling, i care about superscalar execution. JVM doesn't allow it, which means the hardware sits there idle and wasted. so we unroll the code to make this parallelism possible.

Does it make sense? The "unrolling" the JVM does where it uses same registers over and over and forces serial execution is something irrelevant for this purpose.

I totally see why they don't do this, besides the fact there is no "fast math" type flag I am aware of, it would be insanely confusing in general to start getting different answers after JIT compilation "changed".

rmuir added the type:task label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug what happened with 14031 #14042

debug what happened with 14031 #14042

rmuir commented Dec 4, 2024

jpountz commented Dec 6, 2024

ChrisHegarty commented Dec 11, 2024

rmuir commented Dec 11, 2024

ChrisHegarty commented Dec 16, 2024 •

edited

Loading

rmuir commented Dec 16, 2024

rmuir commented Dec 17, 2024

debug what happened with 14031 #14042

debug what happened with 14031 #14042

Comments

rmuir commented Dec 4, 2024

Description

jpountz commented Dec 6, 2024

ChrisHegarty commented Dec 11, 2024

rmuir commented Dec 11, 2024

ChrisHegarty commented Dec 16, 2024 • edited Loading

rmuir commented Dec 16, 2024

rmuir commented Dec 17, 2024

ChrisHegarty commented Dec 16, 2024 •

edited

Loading