Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CPU loops simd & ivdep #436

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Make CPU loops simd & ivdep #436

wants to merge 4 commits into from

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Nov 9, 2023

No description provided.

Copy link
Contributor

github-actions bot commented Aug 7, 2024

Benchmark Results

main 1d50404... main/1d504044826026...
saxpy/default/Float16/1024 2.76 ± 0.19 μs 2.76 ± 0.19 μs 0.999
saxpy/default/Float16/1048576 2.07 ± 0.013 ms 2.07 ± 0.013 ms 0.999
saxpy/default/Float16/16384 0.0327 ± 0.00013 ms 0.0327 ± 0.00013 ms 1
saxpy/default/Float16/2048 5.18 ± 0.035 μs 5.16 ± 0.037 μs 1
saxpy/default/Float16/256 0.971 ± 0.12 μs 0.969 ± 0.073 μs 1
saxpy/default/Float16/262144 0.515 ± 0.0011 ms 0.516 ± 0.0013 ms 0.999
saxpy/default/Float16/32768 0.0649 ± 0.00016 ms 0.0649 ± 0.00017 ms 1
saxpy/default/Float16/4096 10 ± 0.05 μs 10 ± 0.05 μs 0.999
saxpy/default/Float16/512 1.56 ± 0.07 μs 1.57 ± 0.077 μs 0.988
saxpy/default/Float16/64 0.624 ± 0.014 μs 0.615 ± 0.014 μs 1.01
saxpy/default/Float16/65536 0.129 ± 0.00026 ms 0.129 ± 0.00025 ms 1
saxpy/default/Float32/1024 1.04 ± 0.014 μs 1.02 ± 0.011 μs 1.02
saxpy/default/Float32/1048576 0.886 ± 0.016 ms 0.889 ± 0.018 ms 0.997
saxpy/default/Float32/16384 14.5 ± 0.12 μs 14.5 ± 0.12 μs 0.999
saxpy/default/Float32/2048 1.73 ± 0.02 μs 1.72 ± 0.021 μs 1
saxpy/default/Float32/256 0.531 ± 0.12 μs 0.522 ± 0.13 μs 1.02
saxpy/default/Float32/262144 0.221 ± 0.00071 ms 0.221 ± 0.00091 ms 1
saxpy/default/Float32/32768 28.3 ± 0.17 μs 28.4 ± 0.2 μs 0.997
saxpy/default/Float32/4096 3.03 ± 0.028 μs 3.03 ± 0.024 μs 1
saxpy/default/Float32/512 0.7 ± 0.11 μs 0.696 ± 0.11 μs 1.01
saxpy/default/Float32/64 0.414 ± 0.0036 μs 0.408 ± 0.0035 μs 1.02
saxpy/default/Float32/65536 0.0562 ± 0.00045 ms 0.0563 ± 0.00043 ms 0.998
saxpy/default/Float64/1024 1.07 ± 0.024 μs 1.08 ± 0.032 μs 0.992
saxpy/default/Float64/1048576 0.945 ± 0.052 ms 0.97 ± 0.053 ms 0.975
saxpy/default/Float64/16384 15.8 ± 1 μs 15.7 ± 0.81 μs 1.01
saxpy/default/Float64/2048 1.76 ± 0.027 μs 1.77 ± 0.029 μs 0.997
saxpy/default/Float64/256 0.533 ± 0.008 μs 0.516 ± 0.0086 μs 1.03
saxpy/default/Float64/262144 0.228 ± 0.0089 ms 0.227 ± 0.0037 ms 1
saxpy/default/Float64/32768 31 ± 2.3 μs 30.6 ± 1.5 μs 1.01
saxpy/default/Float64/4096 3.05 ± 0.037 μs 3.04 ± 0.038 μs 1
saxpy/default/Float64/512 0.703 ± 0.11 μs 0.698 ± 0.11 μs 1.01
saxpy/default/Float64/64 0.406 ± 0.006 μs 0.407 ± 0.0056 μs 0.998
saxpy/default/Float64/65536 0.0615 ± 0.0045 ms 0.058 ± 0.0018 ms 1.06
saxpy/static workgroup=(1024,)/Float16/1024 2.07 ± 0.2 μs 2.09 ± 0.21 μs 0.991
saxpy/static workgroup=(1024,)/Float16/1048576 0.168 ± 0.011 ms 0.17 ± 0.0085 ms 0.986
saxpy/static workgroup=(1024,)/Float16/16384 4.34 ± 0.22 μs 4.32 ± 0.2 μs 1
saxpy/static workgroup=(1024,)/Float16/2048 2.1 ± 0.21 μs 2.12 ± 0.2 μs 0.993
saxpy/static workgroup=(1024,)/Float16/256 2.62 ± 0.035 μs 2.62 ± 0.04 μs 1
saxpy/static workgroup=(1024,)/Float16/262144 0.0437 ± 0.0027 ms 0.0459 ± 0.0036 ms 0.952
saxpy/static workgroup=(1024,)/Float16/32768 6.54 ± 0.31 μs 6.7 ± 0.21 μs 0.975
saxpy/static workgroup=(1024,)/Float16/4096 2.41 ± 0.031 μs 2.42 ± 0.034 μs 0.995
saxpy/static workgroup=(1024,)/Float16/512 3.14 ± 0.066 μs 3.14 ± 0.068 μs 1
saxpy/static workgroup=(1024,)/Float16/64 2.24 ± 0.023 μs 2.24 ± 0.022 μs 1
saxpy/static workgroup=(1024,)/Float16/65536 12.6 ± 0.73 μs 12.2 ± 0.35 μs 1.03
saxpy/static workgroup=(1024,)/Float32/1024 1.93 ± 0.025 μs 1.94 ± 0.024 μs 0.998
saxpy/static workgroup=(1024,)/Float32/1048576 0.273 ± 0.025 ms 0.263 ± 0.024 ms 1.04
saxpy/static workgroup=(1024,)/Float32/16384 4.53 ± 0.88 μs 4.78 ± 0.42 μs 0.949
saxpy/static workgroup=(1024,)/Float32/2048 2.24 ± 0.21 μs 2.25 ± 0.21 μs 0.996
saxpy/static workgroup=(1024,)/Float32/256 2.81 ± 1.8 μs 2.8 ± 1.6 μs 1.01
saxpy/static workgroup=(1024,)/Float32/262144 0.0646 ± 0.0052 ms 0.0644 ± 0.0051 ms 1
saxpy/static workgroup=(1024,)/Float32/32768 7.54 ± 1.4 μs 7.88 ± 1.1 μs 0.957
saxpy/static workgroup=(1024,)/Float32/4096 2.54 ± 0.088 μs 2.53 ± 0.19 μs 1
saxpy/static workgroup=(1024,)/Float32/512 2.47 ± 0.21 μs 2.49 ± 0.21 μs 0.993
saxpy/static workgroup=(1024,)/Float32/64 2.44 ± 0.051 μs 2.43 ± 0.05 μs 1
saxpy/static workgroup=(1024,)/Float32/65536 17.8 ± 1.8 μs 17.3 ± 1.8 μs 1.03
saxpy/static workgroup=(1024,)/Float64/1024 2.04 ± 0.028 μs 2.04 ± 0.032 μs 1
saxpy/static workgroup=(1024,)/Float64/1048576 0.6 ± 0.076 ms 0.566 ± 0.051 ms 1.06
saxpy/static workgroup=(1024,)/Float64/16384 7.73 ± 1.4 μs 7.8 ± 1.2 μs 0.99
saxpy/static workgroup=(1024,)/Float64/2048 2.51 ± 0.23 μs 2.51 ± 0.3 μs 0.999
saxpy/static workgroup=(1024,)/Float64/256 2.39 ± 0.051 μs 2.39 ± 0.056 μs 0.999
saxpy/static workgroup=(1024,)/Float64/262144 0.122 ± 0.0099 ms 0.123 ± 0.0096 ms 0.988
saxpy/static workgroup=(1024,)/Float64/32768 17.5 ± 2 μs 16.6 ± 1.9 μs 1.06
saxpy/static workgroup=(1024,)/Float64/4096 3.1 ± 0.32 μs 3.08 ± 0.32 μs 1.01
saxpy/static workgroup=(1024,)/Float64/512 2.37 ± 0.041 μs 2.37 ± 0.039 μs 0.997
saxpy/static workgroup=(1024,)/Float64/64 2.36 ± 0.087 μs 2.38 ± 0.083 μs 0.993
saxpy/static workgroup=(1024,)/Float64/65536 0.0349 ± 0.0024 ms 0.0329 ± 0.0028 ms 1.06
time_to_load 0.462 ± 0.0041 s 0.465 ± 0.0025 s 0.995

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant