Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: adds avx512 vector ops for koalabear and babybear fields #568

Open
wants to merge 63 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
539e90b
checkpoint
gbotrel Oct 15, 2024
c60ce4c
checkpoint
gbotrel Oct 15, 2024
5832e15
build: update bavard
gbotrel Oct 15, 2024
ec77f9c
checkpoint
gbotrel Oct 15, 2024
c9363b5
checkpoint
gbotrel Oct 16, 2024
0015ca0
checkpoint
gbotrel Oct 16, 2024
e38579f
checkpoint
gbotrel Oct 16, 2024
a334387
checkpoint
gbotrel Oct 16, 2024
637a68a
feat: added butterfly asm, experiment
gbotrel Oct 16, 2024
222036a
checkpoint
gbotrel Oct 16, 2024
6fbfb48
checkpoint
gbotrel Oct 16, 2024
6f4d15b
checkpoint
gbotrel Oct 18, 2024
8bc8267
checkpoint
gbotrel Oct 18, 2024
13521a9
checkpoint
gbotrel Oct 19, 2024
a0e023e
code cleaning
gbotrel Oct 19, 2024
ea8340b
feat: refactor code generation to allow space for arm64
gbotrel Oct 21, 2024
99283df
style: code cleaning
gbotrel Oct 21, 2024
08a7afa
feat: add reduce arm64 for test purposes
gbotrel Oct 21, 2024
771ce54
feat: restore vectors.json mimc
gbotrel Oct 22, 2024
e303fc2
style: add trace in mimc generate
gbotrel Oct 22, 2024
548dfd1
feat: update build tags for 32 bit target
gbotrel Oct 22, 2024
147a62e
feat: generalize arm64 mul for larger modulus
gbotrel Oct 22, 2024
3638f44
feat: add missing files
gbotrel Oct 22, 2024
ed06e35
checkpoint
gbotrel Nov 22, 2024
a170cb1
test passing
gbotrel Nov 23, 2024
dddb22d
feat: add babybear and koalabear
gbotrel Nov 25, 2024
6ae75a9
fix: restore line return to minimize diff
gbotrel Nov 25, 2024
18b7374
feat: restore non-big int inverse
gbotrel Nov 25, 2024
f86ef72
test: fix field config test to take word size into account
gbotrel Nov 25, 2024
2576e72
feat: cleanup add template for field element
gbotrel Dec 4, 2024
93b6669
feat: less ops in mul generic 31bits
gbotrel Dec 4, 2024
f276812
style: cleaning PR
gbotrel Dec 6, 2024
db78c7b
style: more cleaning
gbotrel Dec 6, 2024
5c569d0
feat: cleaner mont mul, slower
gbotrel Dec 6, 2024
0a20412
fix integration test
gbotrel Dec 6, 2024
ac9720a
style: more cleaning
gbotrel Dec 6, 2024
725a476
test: fix failing generator test
gbotrel Dec 6, 2024
5b11602
test: fix field config test to use bitsize
gbotrel Dec 6, 2024
8b3b4d1
feat: on 31bit field better branch-less add and sub
gbotrel Dec 6, 2024
d022a64
feat: skeletton for vec assembly on F31
gbotrel Dec 7, 2024
9216305
refactor: rename asm generation code for 4 words
gbotrel Dec 7, 2024
deb1d7f
refactor: rename asm generation code for 4 words
gbotrel Dec 7, 2024
8f230c2
feat: added F31 avx512 add
gbotrel Dec 8, 2024
d5a6b4d
feat: add avx512 sub for f31
gbotrel Dec 8, 2024
cf8370d
feat: added avx512 sum for f31
gbotrel Dec 8, 2024
8289d3d
feat: working version of the mul, optims to come
gbotrel Dec 8, 2024
f5e93c6
feat: clean up mul avx f31
gbotrel Dec 8, 2024
17beb83
feat: add avx512 scalarMul vec for F31
gbotrel Dec 8, 2024
c1b06b4
feat: add innerProdVec avx512 for f31
gbotrel Dec 8, 2024
9af3aed
style: code cleaning
gbotrel Dec 8, 2024
5514b84
style: more cleaning
gbotrel Dec 8, 2024
ccb7ad1
refactor: give nb bits to asm generation
gbotrel Dec 8, 2024
9651c06
refactor: distinguish nb of bits in file name for generated assembly
gbotrel Dec 8, 2024
31b74f2
feat: add missing file
gbotrel Dec 8, 2024
0f549b5
test: fix broken integration test
gbotrel Dec 9, 2024
58ffe06
Merge branch 'master' into experiment/31bits
gbotrel Dec 10, 2024
1a6e1bc
Merge branch 'experiment/31bits' into perf/f31_avx
gbotrel Dec 10, 2024
34fdd6e
chore: run go mod tidy
gbotrel Dec 10, 2024
e704847
Merge branch 'master' into perf/f31_avx
gbotrel Dec 10, 2024
6dd8803
Merge branch 'master' into perf/f31_avx
gbotrel Dec 10, 2024
21b9b80
chore: re run go generate to update doc
gbotrel Dec 10, 2024
e8a14ad
Merge branch 'master' into perf/f31_avx
gbotrel Dec 13, 2024
69dd02d
test: fix integration test
gbotrel Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 222 additions & 0 deletions field/asm/element_31b_amd64.s
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
// Code generated by gnark-crypto/generator. DO NOT EDIT.
#include "textflag.h"
#include "funcdata.h"
#include "go_asm.h"

// addVec(res, a, b *Element, n uint64) res[0...n] = a[0...n] + b[0...n]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If n is len(slice)/16 it should be documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet, just shift it by 4 in assembly instead of on the go side.

TEXT ·addVec(SB), NOSPLIT, $0-32
MOVD $const_q, AX
VPBROADCASTD AX, Z3
MOVQ res+0(FP), CX
MOVQ a+8(FP), AX
MOVQ b+16(FP), DX
MOVQ n+24(FP), BX

loop_1:
TESTQ BX, BX
JEQ done_2 // n == 0, we are done
VMOVDQU32 0(AX), Z0
VMOVDQU32 0(DX), Z1
VPADDD Z0, Z1, Z0 // a = a + b
VPSUBD Z3, Z0, Z2 // t = a - q
VPMINUD Z0, Z2, Z1 // b = min(t, a)
VMOVDQU32 Z1, 0(CX) // res = b

// increment pointers to visit next element
ADDQ $64, AX
ADDQ $64, DX
ADDQ $64, CX
DECQ BX // decrement n
JMP loop_1

done_2:
RET

// subVec(res, a, b *Element, n uint64) res[0...n] = a[0...n] - b[0...n]
TEXT ·subVec(SB), NOSPLIT, $0-32
MOVD $const_q, AX
VPBROADCASTD AX, Z3
MOVQ res+0(FP), CX
MOVQ a+8(FP), AX
MOVQ b+16(FP), DX
MOVQ n+24(FP), BX

loop_3:
TESTQ BX, BX
JEQ done_4 // n == 0, we are done
VMOVDQU32 0(AX), Z0
VMOVDQU32 0(DX), Z1
VPSUBD Z1, Z0, Z0 // a = a - b
VPADDD Z3, Z0, Z2 // t = a + q
VPMINUD Z0, Z2, Z1 // b = min(t, a)
VMOVDQU32 Z1, 0(CX) // res = b

// increment pointers to visit next element
ADDQ $64, AX
ADDQ $64, DX
ADDQ $64, CX
DECQ BX // decrement n
JMP loop_3

done_4:
RET

// sumVec(res *uint64, a *[]uint32, n uint64) res = sum(a[0...n])
TEXT ·sumVec(SB), NOSPLIT, $0-24

// We load 8 31bits values at a time and accumulate them into an accumulator of
// 8 quadwords (64bits). The caller then needs to reduce the result mod q.
// We can safely accumulate ~2**33 31bits values into a single accumulator.
// That gives us a maximum of 2**33 * 8 = 2**36 31bits values to sum safely.

MOVQ t+0(FP), R15
MOVQ a+8(FP), R14
MOVQ n+16(FP), CX
VXORPS Z2, Z2, Z2 // acc1 = 0
VMOVDQA64 Z2, Z3 // acc2 = 0

loop_5:
TESTQ CX, CX
JEQ done_6 // n == 0, we are done
VPMOVZXDQ 0(R14), Z0 // load 8 31bits values in a1
VPMOVZXDQ 32(R14), Z1 // load 8 31bits values in a2
VPADDQ Z0, Z2, Z2 // acc1 += a1
VPADDQ Z1, Z3, Z3 // acc2 += a2

// increment pointers to visit next element
ADDQ $64, R14
DECQ CX // decrement n
JMP loop_5

done_6:
VPADDQ Z2, Z3, Z2 // acc1 += acc2
VMOVDQU64 Z2, 0(R15) // res = acc1
RET

// mulVec(res, a, b *Element, n uint64) res[0...n] = a[0...n] * b[0...n]
TEXT ·mulVec(SB), NOSPLIT, $0-32
MOVD $const_q, AX
VPBROADCASTQ AX, Z3
MOVD $const_qInvNeg, AX
VPBROADCASTQ AX, Z4

// Create mask for low dword in each qword
VPCMPEQB Y0, Y0, Y0
VPMOVZXDQ Y0, Z6
MOVQ res+0(FP), CX
MOVQ a+8(FP), AX
MOVQ b+16(FP), DX
MOVQ n+24(FP), BX

loop_7:
TESTQ BX, BX
JEQ done_8 // n == 0, we are done
VPMOVZXDQ 0(AX), Z0
VPMOVZXDQ 0(DX), Z1
VPMULUDQ Z0, Z1, Z2 // P = a * b
VPANDQ Z6, Z2, Z5 // m = uint32(P)
VPMULUDQ Z5, Z4, Z5 // m = m * qInvNeg
VPANDQ Z6, Z5, Z5 // m = uint32(m)
VPMULUDQ Z5, Z3, Z5 // m = m * q
VPADDQ Z2, Z5, Z2 // P = P + m
VPSRLQ $32, Z2, Z2 // P = P >> 32
VPSUBD Z3, Z2, Z5 // PL = P - q
VPMINUD Z2, Z5, Z2 // P = min(P, PL)
VPMOVQD Z2, 0(CX) // res = P

// increment pointers to visit next element
ADDQ $32, AX
ADDQ $32, DX
ADDQ $32, CX
DECQ BX // decrement n
JMP loop_7

done_8:
RET

// scalarMulVec(res, a, b *Element, n uint64) res[0...n] = a[0...n] * b
TEXT ·scalarMulVec(SB), NOSPLIT, $0-32
MOVD $const_q, AX
VPBROADCASTQ AX, Z3
MOVD $const_qInvNeg, AX
VPBROADCASTQ AX, Z4

// Create mask for low dword in each qword
VPCMPEQB Y0, Y0, Y0
VPMOVZXDQ Y0, Z6
MOVQ res+0(FP), CX
MOVQ a+8(FP), AX
MOVQ b+16(FP), DX
MOVQ n+24(FP), BX
VPBROADCASTD 0(DX), Z1

loop_9:
TESTQ BX, BX
JEQ done_10 // n == 0, we are done
VPMOVZXDQ 0(AX), Z0
VPMULUDQ Z0, Z1, Z2 // P = a * b
VPANDQ Z6, Z2, Z5 // m = uint32(P)
VPMULUDQ Z5, Z4, Z5 // m = m * qInvNeg
VPANDQ Z6, Z5, Z5 // m = uint32(m)
VPMULUDQ Z5, Z3, Z5 // m = m * q
VPADDQ Z2, Z5, Z2 // P = P + m
VPSRLQ $32, Z2, Z2 // P = P >> 32
VPSUBD Z3, Z2, Z5 // PL = P - q
VPMINUD Z2, Z5, Z2 // P = min(P, PL)
VPMOVQD Z2, 0(CX) // res = P

// increment pointers to visit next element
ADDQ $32, AX
ADDQ $32, CX
DECQ BX // decrement n
JMP loop_9

done_10:
RET

// innerProdVec(t *uint64, a,b *[]uint32, n uint64) res = sum(a[0...n] * b[0...n])
TEXT ·innerProdVec(SB), NOSPLIT, $0-32

// Similar to mulVec; we do most of the montgomery multiplication but don't do
// the final reduction. We accumulate the result like in sumVec and let the caller
// reduce mod q.

MOVD $const_q, AX
VPBROADCASTQ AX, Z3
MOVD $const_qInvNeg, AX
VPBROADCASTQ AX, Z4

// Create mask for low dword in each qword
VPCMPEQB Y0, Y0, Y0
VPMOVZXDQ Y0, Z6
VXORPS Z2, Z2, Z2 // acc = 0
MOVQ t+0(FP), CX
MOVQ a+8(FP), R14
MOVQ b+16(FP), R15
MOVQ n+24(FP), BX

loop_11:
TESTQ BX, BX
JEQ done_12 // n == 0, we are done
VPMOVZXDQ 0(R14), Z0
VPMOVZXDQ 0(R15), Z1
VPMULUDQ Z0, Z1, Z7 // P = a * b
VPANDQ Z6, Z7, Z5 // m = uint32(P)
VPMULUDQ Z5, Z4, Z5 // m = m * qInvNeg
VPANDQ Z6, Z5, Z5 // m = uint32(m)
VPMULUDQ Z5, Z3, Z5 // m = m * q
VPADDQ Z7, Z5, Z7 // P = P + m
VPSRLQ $32, Z7, Z7 // P = P >> 32

// accumulate P into acc, P is in [0, 2q] on 32bits max
VPADDQ Z7, Z2, Z2 // acc += P

// increment pointers to visit next element
ADDQ $32, R14
ADDQ $32, R15
DECQ BX // decrement n
JMP loop_11

done_12:
VMOVDQU64 Z2, 0(CX) // res = acc
RET
49 changes: 0 additions & 49 deletions field/babybear/arith.go

This file was deleted.

15 changes: 15 additions & 0 deletions field/babybear/asm_avx.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions field/babybear/asm_noavx.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion field/babybear/doc.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions field/babybear/element_amd64.s
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
//go:build !purego

// Copyright 2020-2024 Consensys Software Inc.
// Licensed under the Apache License, Version 2.0. See the LICENSE file for details.

// Code generated by consensys/gnark-crypto DO NOT EDIT

// We include the hash to force the Go compiler to recompile: 11172894854395138580
#include "../asm/element_31b_amd64.s"

Loading
Loading