Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: adds avx512 vector ops for koalabear and babybear fields #568

Open
wants to merge 63 commits into
base: master
Choose a base branch
from

Conversation

gbotrel
Copy link
Collaborator

@gbotrel gbotrel commented Dec 8, 2024

Description

The assembly is readable and a breeze to work with after doing the same things with multi words modulus.

Probably couple of perf aberration to correct and optimization to do.

Need to compare vec::mul with the awesome (and well documented :) ) work in Plonky3: https://github.com/Plonky3/Plonky3/blob/20256720b683897b634393dadcf8afab43101cb7/monty-31/src/x86_64_avx512/packing.rs#L319

Benchmark example

benchmark                                      old ns/op     new ns/op     delta
BenchmarkVectorOps/add_256-32                  141           14.7          -89.58%
BenchmarkVectorOps/sub_256-32                  282           14.8          -94.75%
BenchmarkVectorOps/scalarMul_256-32            299           62.8          -79.02%
BenchmarkVectorOps/sum_256-32                  224           30.5          -86.36%
BenchmarkVectorOps/innerProduct_256-32         496           74.9          -84.91%
BenchmarkVectorOps/mul_256-32                  299           66.8          -77.66%
BenchmarkVectorOps/add_512-32                  285           25.8          -90.96%
BenchmarkVectorOps/sub_512-32                  568           26.0          -95.43%
BenchmarkVectorOps/scalarMul_512-32            601           125           -79.23%
BenchmarkVectorOps/sum_512-32                  479           41.3          -91.39%
BenchmarkVectorOps/innerProduct_512-32         993           128           -87.12%
BenchmarkVectorOps/mul_512-32                  601           137           -77.22%
BenchmarkVectorOps/add_65536-32                39938         7742          -80.61%
BenchmarkVectorOps/sub_65536-32                80858         7730          -90.44%
BenchmarkVectorOps/scalarMul_65536-32          83196         16021         -80.74%
BenchmarkVectorOps/sum_65536-32                58330         2849          -95.12%
BenchmarkVectorOps/innerProduct_65536-32       133302        13518         -89.86%
BenchmarkVectorOps/mul_65536-32                86508         16781         -80.60%
BenchmarkVectorOps/add_524288-32               318606        68041         -78.64%
BenchmarkVectorOps/sub_524288-32               639760        68476         -89.30%
BenchmarkVectorOps/scalarMul_524288-32         664143        127940        -80.74%
BenchmarkVectorOps/sum_524288-32               464263        23079         -95.03%
BenchmarkVectorOps/innerProduct_524288-32      1068978       108061        -89.89%
BenchmarkVectorOps/mul_524288-32               689172        133001        -80.70%
BenchmarkVectorOps/add_1048576-32              638737        138234        -78.36%
BenchmarkVectorOps/sub_1048576-32              1282222       138241        -89.22%
BenchmarkVectorOps/scalarMul_1048576-32        1331820       256007        -80.78%
BenchmarkVectorOps/sum_1048576-32              924642        46221         -95.00%
BenchmarkVectorOps/innerProduct_1048576-32     2138163       215971        -89.90%
BenchmarkVectorOps/mul_1048576-32              1379577       266948        -80.65%
BenchmarkVectorOps/add_2097152-32              1298483       403597        -68.92%
BenchmarkVectorOps/sub_2097152-32              2599115       398351        -84.67%
BenchmarkVectorOps/scalarMul_2097152-32        2675315       514020        -80.79%
BenchmarkVectorOps/sum_2097152-32              1846965       92459         -94.99%
BenchmarkVectorOps/innerProduct_2097152-32     4279783       433708        -89.87%
BenchmarkVectorOps/mul_2097152-32              2801868       557975        -80.09%

@gbotrel gbotrel requested review from Tabaie and yelhousni December 8, 2024 17:55
Base automatically changed from experiment/31bits to master December 10, 2024 19:00
#include "funcdata.h"
#include "go_asm.h"

// addVec(res, a, b *Element, n uint64) res[0...n] = a[0...n] + b[0...n]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If n is len(slice)/16 it should be documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet, just shift it by 4 in assembly instead of on the go side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants