Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug what happened with 14031 #14042

Open
rmuir opened this issue Dec 4, 2024 · 6 comments
Open

debug what happened with 14031 #14042

rmuir opened this issue Dec 4, 2024 · 6 comments

Comments

@rmuir
Copy link
Member

rmuir commented Dec 4, 2024

Description

PR #14031 makes the float vector functions inlinable, but caused a drop in nightly benchmark.
We reverted the PR in #14041 , but we should try to figure it out, as we don't want unpredictable performance.

Maybe the problem is specific to AMD and we should test that. Also it would be great if we could safely add some debugging to the nightly benchmark to understand what happens with the compilation (e.g. -XX:+PrintCompilation or -XX:+LogCompilation).

@rmuir rmuir added the type:task label Dec 4, 2024
@jpountz
Copy link
Contributor

jpountz commented Dec 6, 2024

I have a computer with a AMD Ryzen 9 3900X 12-Core Processor in case you want me to run some benchmarks.

@ChrisHegarty
Copy link
Contributor

I've yet to do any real debugging, but my sense here is that the aggressive unrolling in that we do (4x with float dot), is hurting us when we inline. I'll try to reproduce on my Intel box, this might not be AMD specific (might just hurt more there). I wanna run luceneutil benchmarks rather than the micro-benchmarks.

@rmuir
Copy link
Member Author

rmuir commented Dec 11, 2024

I've yet to do any real debugging, but my sense here is that the aggressive unrolling in that we do (4x with float dot), is hurting us when we inline. I'll try to reproduce on my Intel box, this might not be AMD specific (might just hurt more there). I wanna run luceneutil benchmarks rather than the micro-benchmarks.

what else are we to do though? cpus have multiple fma units, jvm won't unroll as it will change results of floating point.

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Dec 16, 2024

Hotspot will unroll the loops that are using the Vector API to do floating-point arithmetic. On my Intel box dotProductBody gets unrolled 4x, and since it is already hand-unrolled 4x, we get effectively 16x unrolling. e.g.

;; B49: #      out( B49 B50 ) <- in( B48 B49 ) Loop( B49-B49 inner main of N191 strip mined) Freq: 272.018
0x000079609ff2fa11:   vmovdqu32   zmm0,ZMMWORD PTR [rdx+rax*4+0x310]
0x000079609ff2fa1c:   vmovdqu32   zmm1,ZMMWORD PTR [rdx+rax*4+0x210]
0x000079609ff2fa27:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0xd0]
0x000079609ff2fa32:   vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0xd0]
0x000079609ff2fa3d:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x1d0]
0x000079609ff2fa48:   vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x1d0]
0x000079609ff2fa53:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x2d0]
0x000079609ff2fa5e:   vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x2d0]
0x000079609ff2fa69:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x3d0]
0x000079609ff2fa74:   vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x3d0]
0x000079609ff2fa7f:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x90]
0x000079609ff2fa8a:   vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x90]
0x000079609ff2fa95:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x190]
0x000079609ff2faa0:   vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x190]
0x000079609ff2faab:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x290]
0x000079609ff2fab6:   vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x290]
0x000079609ff2fac1:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x390]
0x000079609ff2facc:   vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x390]
0x000079609ff2fad7:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x50]
0x000079609ff2fae2:   vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x50]
0x000079609ff2faed:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x150]
0x000079609ff2faf8:   vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x150]
0x000079609ff2fb03:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x250]
0x000079609ff2fb0e:   vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x250]
0x000079609ff2fb19:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x350]
0x000079609ff2fb2f:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x10]
0x000079609ff2fb3a:   vfmadd231ps zmm2,zmm4,ZMMWORD PTR [rcx+rax*4+0x10]
0x000079609ff2fb45:   vmovdqu32   zmm4,ZMMWORD PTR [rdx+rax*4+0x110]
0x000079609ff2fb50:   vfmadd231ps zmm2,zmm4,ZMMWORD PTR [rcx+rax*4+0x110]
0x000079609ff2fb5b:   vfmadd231ps zmm2,zmm1,ZMMWORD PTR [rcx+rax*4+0x210]
0x000079609ff2fb66:   vfmadd231ps zmm2,zmm0,ZMMWORD PTR [rcx+rax*4+0x310]
0x000079609ff2fb71:   add    eax,0x100
0x000079609ff2fb76:   cmp    eax,ebp
0x000079609ff2fb78:   jl     0x000079609ff2fa11
;; B50: #      out( B48 B51 ) <- in( B49 )  Freq: 16.0002

Reducing the unrolling of dotProductBody, to 2x (e.g. draft PR) gives me a bit of an improvement.

Linux

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
main
VectorUtilBenchmark.floatDotProductVector     768  thrpt   75  31.888 ± 0.812  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  26.240 ± 0.550  ops/us

reduce unroll to x2
VectorUtilBenchmark.floatDotProductVector     768  thrpt   75  35.129 ± 0.749  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  28.060 ± 0.619  ops/us

reduce unroll to x2 AND first is mul (rather than FMA)
VectorUtilBenchmark.floatDotProductVector     768  thrpt   75  37.100 ± 0.726  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  29.172 ± 0.514  ops/us

@rmuir
Copy link
Member Author

rmuir commented Dec 16, 2024

Do you know why we see blocks of 4 FMA insns such as vfmadd231ps zmm6,zmm4 using the same registers? This isn't helpful and seems to create unnecessary data dependencies? So if that's "jvm unrolling" then it isn't good enough to count for our purpose?

@rmuir
Copy link
Member Author

rmuir commented Dec 17, 2024

I think we are having a communication issue over terminology. I don't care about unrolling, i care about superscalar execution. JVM doesn't allow it, which means the hardware sits there idle and wasted. so we unroll the code to make this parallelism possible.

Does it make sense? The "unrolling" the JVM does where it uses same registers over and over and forces serial execution is something irrelevant for this purpose.

I totally see why they don't do this, besides the fact there is no "fast math" type flag I am aware of, it would be insanely confusing in general to start getting different answers after JIT compilation "changed".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants