-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use LoopVectorization in julia stencil / transpose #543
Comments
Tested it on an Intel Xeon Gold 6154 CPU @ 3.00GHz with AVX-512 too and there results are more dramatic: julia> @benchmark do_stencil($A, $W, $B, $r, $n)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 46.800 ms (0.00% GC)
median time: 46.831 ms (0.00% GC)
mean time: 46.843 ms (0.00% GC)
maximum time: 47.040 ms (0.00% GC)
--------------
samples: 107
evals/sample: 1 versus julia> @benchmark do_stencil_avx($A, $W, $B, $r, $n)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.192 ms (0.00% GC)
median time: 2.196 ms (0.00% GC)
mean time: 2.197 ms (0.00% GC)
maximum time: 2.443 ms (0.00% GC)
--------------
samples: 2275
evals/sample: 1 that's a 21x speedup over the scalar version (due to contracting * and + to fma, unrolling, and using 512-bits registers for almost everything; I think it uses masked operations if the convolution kernel size isn't a nice multiple of the vector width, in this case with a 7x7 kernel everything is probably masked ops). |
Why do I need to specific this? I hate making the PRK codes nonportable, especially interpreted languages where there is no excuse for such things to be necessary. Is there a portable version like @simd? I've got an Apple M1 laptop now so @avx is going to break, is it not? I can merge this but I'm curious what is wrong in the Julia implementation that they can't JIT AVX without such attributes. |
Julia can SIMD just fine without Regarding the I'm also working on adding automatic threading support and better handling of non-contiguous memory accesses, so some of the benchmarks should notably improve soon. The macro is called Any cases of it not being portable are a bug. I'm looking forward to trying it out with SVE someday, for example. |
I just checked on two non-x86 machines: Fujitsu a64fx (not really supported by LLVM 11);
Seems like it still unrolls and generates fma instructions, but does not vectorize. And POWER8:
which seems to vectorize, but I have a hard time getting it to show the generated assembly. |
Okay if it's portable, I'll merge it as soon as I can test locally. I'm currently without power or heat right now in Oregon so it might be a day or so. |
@haampie Is it possible for you to make a pull request for this? |
What type of issue is this?
LoopVectorization.jl usually does a better job than the julia compiler + llvm at unrolling and vectorization. You might want to use it for some of the benchmarks.
For instance on zen2:
The text was updated successfully, but these errors were encountered: