Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX512 support, new PR to replace old branch from contributor. We need CI to run so moving it here. #72

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

JonathanHenson
Copy link
Contributor

@JonathanHenson JonathanHenson commented Jul 19, 2023

scratch space for this:

#68

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* length must be at least 256, and a multiple of 64. Based on:
*
* "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
* V. Gopal, E. Ozturk, et al., 2009, http://intel.ly/2ySEwL0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This citation is a broken link, do we have an updated one? There's a few copies floating around but I think we'd prefer an authoritative source if possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 38 to 45
static zalign_8 k1k2[8] =
{0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};

static zalign_8 k3k4[8] =
{0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8};
static zalign_2 k5k6[2] = {0xf20c0dfe, 0x14cd00bd6};
static zalign_2 k7k8[2] = {0xdd45aab8, 0x000000000};
static zalign_2 poly[2] = {0x105ec76f1, 0xdea713f1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be static const.

Comment on lines +79 to +82
y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
Copy link
Contributor

@bdonlan bdonlan Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to move the loads up to the start of the loop, since they have potentially longer latency than clmul? (depending on whether the data is cached)

Edit: I see now this is effectively front-loading the loads from L53-L56, never mind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, was prefetching considered?

Comment on lines 133 to 177
/*
* Fold 512-bits to 384-bits.
*/
a0 = _mm_load_si128((__m128i *)k5k6);

a1 = _mm512_extracti32x4_epi32(x1, 0);
a2 = _mm512_extracti32x4_epi32(x1, 1);

a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
a1 = _mm_clmulepi64_si128(a1, a0, 0x11);

a1 = _mm_xor_si128(a1, a3);
a1 = _mm_xor_si128(a1, a2);

/*
* Fold 384-bits to 256-bits.
*/
a2 = _mm512_extracti32x4_epi32(x1, 2);
a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
a1 = _mm_xor_si128(a1, a3);
a1 = _mm_xor_si128(a1, a2);

/*
* Fold 256-bits to 128-bits.
*/
a2 = _mm512_extracti32x4_epi32(x1, 3);
a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
a1 = _mm_xor_si128(a1, a3);
a1 = _mm_xor_si128(a1, a2);

/*
* Fold 128-bits to 64-bits.
*/
a2 = _mm_clmulepi64_si128(a1, a0, 0x10);
a3 = _mm_setr_epi32(~0, 0, ~0, 0);
a1 = _mm_srli_si128(a1, 8);
a1 = _mm_xor_si128(a1, a2);

a0 = _mm_loadl_epi64((__m128i *)k7k8);
a2 = _mm_srli_si128(a1, 4);
a1 = _mm_and_si128(a1, a3);
a1 = _mm_clmulepi64_si128(a1, a0, 0x00);
a1 = _mm_xor_si128(a1, a2);
Copy link
Contributor

@bdonlan bdonlan Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this sequence can be improved. This sequence contends a lot on port 5 on alder lake architectures (which handles CLMUL as well as EXTRACTI32X4) and also has a lot of interdependent instruction. In particular, clmul and extract operations both run on port 5 with latency 3 (for alder lake), so avoiding serial sequences of these instructions would be preferable.

One way to reduce this latency would be to shift all of the 64-bit lanes into their final position in one go, before doing any XORs to merge them together. This means that we'd have one phase of CLMULs and XORs to shift into a series of 128-bit lanes within our 512-bit register, then a phase of further EXTRACT/XOR reduction to 128 bits before going into Barret reduction.

Sketch would look a bit like:

x0 = _mm512_load_si512((__m512i *)ksplat); // computing ksplat is left as an exercise for the
                                           // reader
// Shift all 128-bit lanes into their final positions
x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // p5, lat=3, 1/tp=1
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = __mm512_xor_si512(x1, x2);

// Now merge down to 128 bits. Here, again, EXTRACT (p5) and XOR (p05) should run in parallel
// as much as possible.
a0 = __mm512_extracti32x4_epi32(x1, 0); // p5, lat=3, 1/tp=1
a1 = __mm512_extracti32x4_epi32(x1, 1);
a2 = __mm512_extracti32x4_epi32(x1, 2);
a3 = __mm512_extracti32x4_epi32(x1, 3);
a0 = __mm_xor_si128(a0, a1); // p015, lat=1, 1/tp=0.33
a0 = __mm_xor_si128(a0, a2);
a0 = __mm_xor_si128(a0, a3);

// continue with barret reduction

While this would need to be benchmarked, I'd expect it to have significantly lower latency due to a reduction of dependencies between high-latency instructions. The main downside is ksplat will be larger, resulting in larger cache footprint (but, we save on I-cache size, so maybe it's a wash?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is definitely possible to fold from 512 to 256, then from 256 to 128 bits (less steps than above) - example fold from 256 to 128 is here https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L136

here is example of folding 8x128-bits into 1x128 bits using different constants in one step
https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L184C1-L184C1

Comment on lines 98 to 116
/*
* Fold into 512-bits.
*/
x0 = _mm512_load_si512((__m512i *)k3k4);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x2);
x1 = _mm512_xor_si512(x1, x5);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x3);
x1 = _mm512_xor_si512(x1, x5);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x4);
x1 = _mm512_xor_si512(x1, x5);
Copy link
Contributor

@bdonlan bdonlan Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to have a path that skips forward to the loop at L121 without doing this reduction for inputs smaller than 256 bytes. This should be free as we have to compare the input length anyway before the loop at L68

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, note that initial crc value needs to be handled separately then


/* For small input, forget about alignment checks - simply compute the CRC32c one byte at a time */
if (length < (int)sizeof(slice_ptr_int_type)) {
while (length-- > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 16/32/64-bit variants of the CRC32 op that we should probably take advantage of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will use the 64 bit version on x64 and 32bit on x86. That’s what those typedefs and defines in the private header are for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will use the 64 bit version on x64 and 32bit on x86. That’s what those typedefs and defines in the private header are for.

Nevermind, I was at the wrong place in the file. This branch is for tiny inputs, and it doesn’t seem worth optimizing to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be interested in performance for small inputs as well - but I'd agree that's more likely to be dwarfed by surrounding code for sure.

Comment on lines 121 to 131
while (length >= 64) {
x2 = _mm512_loadu_si512((__m512i *)input);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x2);
x1 = _mm512_xor_si512(x1, x5);

input += 64;
length -= 64;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's any opportunity to use mask registers to perform a sub-64-byte CRC operation? This would remove the need for a byte-by-byte loop, replacing it with a _mm512_mask_load8_epi8 operation, an extra round of CLMUL, and some ancillary scalar logic to compute the right mask value and select the right set of shift constants from a static table.

For this to work, we'd need to prepare a table of multiplication constants for various shift lengths, which would increase our cache footprint. There's probably a tradeoff here - maybe have a table for each 64-bit increment and use the CRC32 instruction for the remaining bytes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution may be possible, loaded message has to match right constants as well. ISA-L and ipsec-mb crc implementations have special code path for less than 256 bytes (then multiple cases considered there) - essentially it processed the message in 16 byte chunks

tests/crc_test.c Outdated
Comment on lines 105 to 115

struct aws_byte_buf avx_buf;
/* enough for two avx512 runs */
aws_byte_buf_init(&avx_buf, allocator, 512);
aws_device_random_buffer(&avx_buf);

uint32_t crc = aws_checksums_crc32c_sw(avx_buf.buffer, (int)avx_buf.len, 0);
uint32_t hw_crc = aws_checksums_crc32c_hw(avx_buf.buffer, (int)avx_buf.len, 0);

aws_byte_buf_clean_up(&avx_buf);
ASSERT_UINT_EQUALS(hw_crc, crc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should test various sizes of unaligned buffers as well. I'd prefer to test every buffer size from 0 to 512+1 bytes, and have some tests that deliberatly unalign the buffer, and/or put it just before a page boundary where the next page is inaccessible.

Comment on lines 101 to 106
x0 = _mm512_load_si512((__m512i *)k3k4);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x2);
x1 = _mm512_xor_si512(x1, x5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values that start out in x1 are effectively being multiplied three times by k3k4. (Similarly, x2 is multiplied twice, x3 once). Would we be able to reduce the dependency chain here by having three premultiplied constants instead of one, and XORing at the end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that having three premultiplied constants does increase cache usage - we could avoid this by multiplying k3k4 multiple times into each lane instead. That is:

x4 = MUL128(k3k4, k3k4) // CLMUL * 2, XOR; lat=5, 1/tp=2 (contends on port 5 for CLMUL)
x1 = MUL128(x1, k3k4)
x3 = MUL128(x3, k3k4)
x2 = MUL128(x2, x4) // waits 2 cycles for CLMUL latency
x1 = MUL128(x1, x4) // waits 2 cycles for CLMUL latency

x1 = (x2 ^ x3) ^ x1 // the first XOR can execute in parallel with the last MUL128

While this requires more multiplications (10 CLMULs vs 6), because CLMULs have a high latency, this approach might end up better utilizing the CLMUL capacity of the CPU.

Copy link
Contributor

@bdonlan bdonlan Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought the premultiplication trick might not work, since we're not doing a real 128-bit multiplication operation. It's probably simpler to just load multiple constants... though, it would be interesting to see what it would cost to compute this constant using a round of barett reduction, and whether this could run in parallel with the memory loads going on in the first part of this function.

Comment on lines +74 to +76
if (detected_sse42 && detected_clmul) {
return aws_checksums_crc32c_sse42(input, length, crc);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the sse42 impl be worth invoking after we've finished processing the lead portion using the avx512 implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments. I will review them closely.

For full disclosure - I re-used the avx512 intrinsic implementation from https://chromium.googlesource.com/chromium/src/third_party/zlib/+/b890619bc2b193b8fbe9c1c053f4cd19a9791d92/crc32_simd.c

but recomputed constants for crc32c polynomial :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbadari I've got the build surgery done and tests passing if you'd like to work on the avx512 comments from bdonlan@

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll need the AVX512 branch from aws-c-common until we merge it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I am reviewing the avx512 comments from bdonlan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jonathan, Most of the review comments (from Donlan) for the AVX512 code is further performance improvements which require careful re-write/proto-type and performance analysis. I reached out to our area expert for his input. I am wondering if we can merge the current patch and update it further when we have new code ready? Please let me know.

Copy link
Contributor Author

@JonathanHenson JonathanHenson Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, AVX512 has a history of being an incredibly risky addition to an already functioning (possibly already running really hot) system. Various chipset versions introduce timing issues for side-channel attacks as well as side-effects to other processes sharing the CPU. So we're going to have to run a lot of tests before we can just run this in production anyways, and we'd like the code to be close to structured in the actual final algorithm before running that analysis.

This particular code runs for a lot of S3 PUT and GET operations across multiple SDKs, so any side-effects would most likely be felt across a large blast radius.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are submitting AVX-512 based implementation of crc32 to zlib-chromium as well and want to make sure that both code contributions are licensed appropriately. For now, can we withdraw the patch/submission. I will resolve the issue and re-submit.

Copy link

@tkanteck tkanteck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on crc performance improvements using new instructions. I am adding a few pointers and comments based on my previous experience with CRC implementations.

Comment on lines 89 to 97
x1 = _mm512_xor_si512(x1, x5);
x2 = _mm512_xor_si512(x2, x6);
x3 = _mm512_xor_si512(x3, x7);
x4 = _mm512_xor_si512(x4, x8);

x1 = _mm512_xor_si512(x1, y5);
x2 = _mm512_xor_si512(x2, y6);
x3 = _mm512_xor_si512(x3, y7);
x4 = _mm512_xor_si512(x4, y8);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ternary logic operation code 0x96 (3-way xor) can be used to reduce number of xor's here (and other places below)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent compiler versions would seem to come up with some ternary logic instructions on their own. For what it is worth, today I came up with an intuitive way of defining the 0x96:

class ternary
{
  static constexpr uint8_t A= 0b11110000;
  static constexpr uint8_t B= 0b11001100;
  static constexpr uint8_t C= 0b10101010;
public:
  static constexpr uint8_t XOR3= A ^ B ^ C;
  static constexpr uint8_t XNOR3= uint8_t(~(A ^ B ^ C));
  static constexpr uint8_t XOR2_AND= (A ^ B) & C;
};

/** @return a^b^c */
static inline __m128i xor3_128(__m128i a, __m128i b, __m128i c)
{
  return _mm_ternarylogic_epi64(a, b, c, ternary::XOR3);
}

In last-millennium C, you might write something like the following:

#define TERNARY_A 0xf0
#define TERNARY_B 0xcc
#define TERNARY_C 0xaa
#define TERNARY_XOR3 (TERNARY_A ^ TERNARY_B ^ TERNARY_C)

Comment on lines 98 to 116
/*
* Fold into 512-bits.
*/
x0 = _mm512_load_si512((__m512i *)k3k4);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x2);
x1 = _mm512_xor_si512(x1, x5);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x3);
x1 = _mm512_xor_si512(x1, x5);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x4);
x1 = _mm512_xor_si512(x1, x5);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, note that initial crc value needs to be handled separately then

Comment on lines 121 to 131
while (length >= 64) {
x2 = _mm512_loadu_si512((__m512i *)input);

x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = _mm512_xor_si512(x1, x2);
x1 = _mm512_xor_si512(x1, x5);

input += 64;
length -= 64;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution may be possible, loaded message has to match right constants as well. ISA-L and ipsec-mb crc implementations have special code path for less than 256 bytes (then multiple cases considered there) - essentially it processed the message in 16 byte chunks

Comment on lines 133 to 177
/*
* Fold 512-bits to 384-bits.
*/
a0 = _mm_load_si128((__m128i *)k5k6);

a1 = _mm512_extracti32x4_epi32(x1, 0);
a2 = _mm512_extracti32x4_epi32(x1, 1);

a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
a1 = _mm_clmulepi64_si128(a1, a0, 0x11);

a1 = _mm_xor_si128(a1, a3);
a1 = _mm_xor_si128(a1, a2);

/*
* Fold 384-bits to 256-bits.
*/
a2 = _mm512_extracti32x4_epi32(x1, 2);
a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
a1 = _mm_xor_si128(a1, a3);
a1 = _mm_xor_si128(a1, a2);

/*
* Fold 256-bits to 128-bits.
*/
a2 = _mm512_extracti32x4_epi32(x1, 3);
a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
a1 = _mm_xor_si128(a1, a3);
a1 = _mm_xor_si128(a1, a2);

/*
* Fold 128-bits to 64-bits.
*/
a2 = _mm_clmulepi64_si128(a1, a0, 0x10);
a3 = _mm_setr_epi32(~0, 0, ~0, 0);
a1 = _mm_srli_si128(a1, 8);
a1 = _mm_xor_si128(a1, a2);

a0 = _mm_loadl_epi64((__m128i *)k7k8);
a2 = _mm_srli_si128(a1, 4);
a1 = _mm_and_si128(a1, a3);
a1 = _mm_clmulepi64_si128(a1, a0, 0x00);
a1 = _mm_xor_si128(a1, a2);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is definitely possible to fold from 512 to 256, then from 256 to 128 bits (less steps than above) - example fold from 256 to 128 is here https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L136

here is example of folding 8x128-bits into 1x128 bits using different constants in one step
https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L184C1-L184C1

#if defined(AWS_HAVE_AVX512_INTRINSICS) && (INTPTR_MAX == INT64_MAX)
int chunk_size = length & ~63;

if (detected_avx512 && detected_vpclmulqdq && detected_clmul) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say detect_clmul is not needed, as it will always be true if detected_vpclmulqdq is true


/*
* crc32c_avx512(): compute the crc32c of the buffer, where the buffer
* length must be at least 256, and a multiple of 64. Based on:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing the length to be multiple of 64 bytes looks too restrictive. K-masks can be used to deal with buffer lengths that are not multiple of 64 bytes. Any reason for this condition?

@pbadari
Copy link
Contributor

pbadari commented Aug 8, 2023

Addressed most of the above comments and created a new PR to merge/review changes:

#73

@JonathanHenson
Copy link
Contributor Author

JonathanHenson commented Aug 8, 2023

Addressed most of the above comments and created a new PR to merge/review changes:

#73

Internally, this PR inspired lots of convo and resulted in some more code being sent to contribute here for avx512 crc32 as well. If you created the PR from your own fork it won’t work, because the CI needs creds from someone authorized for our AWS account to run. If so I’ll move it to a branch and apply the new additional changes on top as appropriate. Thanks for this contribution!

@pbadari
Copy link
Contributor

pbadari commented Aug 8, 2023

Addressed most of the above comments and created a new PR to merge/review changes:
#73

Internally, this PR inspired lots of convo and resulted in some more code being sent to contribute here for avx512 crc32 as well. If you created the PR from your own fork it won’t work, because the CI needs creds from someone authorized for our AWS account to run. If so I’ll move it to a branch and apply the new additional changes on top as appropriate. Thanks for this contribution!

Changes in #73 are for the AVX512 branch.

And also, looking at the code crc32 version is NOT (intel) optimized. Is it not used heavily? Do we need to provide optimized version for it too? If so, we may want to have a common implementation for both (crc32 and crc32c) extracting the constants out. Please Let me know.

@pbadari
Copy link
Contributor

pbadari commented Aug 11, 2023

Jonathan, any update on the new branch with all the changes for us to review/analyze?

@JonathanHenson
Copy link
Contributor Author

Jonathan, any update on the new branch with all the changes for us to review/analyze?

It's the next item on my list. Hopefully I'll be able to pick it back up later today

@BadariP
Copy link

BadariP commented Aug 28, 2023

i now have common code with avx512 support for crc32 and crc32c. I will wait for the previous changes to be merged to the branch before submitting the new code.

@BadariP
Copy link

BadariP commented Sep 5, 2023

Jonathan, any update?

@JonathanHenson
Copy link
Contributor Author

Jonathan, any update?

Yes, sorry, i'm still finishing up some other high priority work and will switch back to this as soon as I possibly can.

@BadariP
Copy link

BadariP commented Oct 21, 2023

Jonathan, any update?

@BadariP
Copy link

BadariP commented Dec 1, 2023

Hi Jonathan,

Can you please let me know next steps? Anything I can do to help?

@JonathanHenson
Copy link
Contributor Author

sorry just now getting back to this to try and finish up my work for the year and I noticed you just wanted pr #73 merged. I'm sorry for the delay. I got pulled of into other work. I've merged it over to this branch.

@raviagiri
Copy link

raviagiri commented Jan 1, 2024 via email

@pbadari
Copy link
Contributor

pbadari commented Apr 16, 2024

Jonathan, please let me know what I can do to help to push this work along.

@dr-m
Copy link

dr-m commented May 3, 2024

This week, I finished translating some hand-written NASM code from https://github.com/intel/intel-ipsec-mb/ into C++11 and created https://github.com/dr-m/crc32_simd/. There are no alignment or buffer size restrictions in that implementation. The main loop processes 256 bytes at a time. Any leftovers from the smaller special cases (multiples of 128, 64, 32, 64, 32, 16 bytes) will be handled by a masked vmovdqu that will load 1 to 15 bytes. I made some optimizations in my translation, such as making a negation of the crc value part of the final vternlogq operation. When compiled with GCC 13.2, the "reflected" polynomial variant crc32_avx512<true> is more than 200 bytes shorter than the hand-written crc32_refl_by16_vclmul_avx512. The compiler seems to be doing a good job at fusing machine instructions.

On a system where I tested the crc32_avx512<true> today, it computed a CRC-32C on a 1-gigabyte buffer in 52 milliseconds of user CPU time. Our previous fastest x86-64 implementation of my_crc32c in https://github.com/MariaDB/server/ would use 160 milliseconds. That code is using AVX xmm registers, 64×64→128 pclmul, and SSE 4.2 crc32 instructions.

Should I create a separate pull request for this?

@pbadari
Copy link
Contributor

pbadari commented May 3, 2024

This week, I finished translating some hand-written NASM code from https://github.com/intel/intel-ipsec-mb/ into C++11 and created https://github.com/dr-m/crc32_simd/. There are no alignment or buffer size restrictions in that implementation. The main loop processes 256 bytes at a time. Any leftovers from the smaller special cases (multiples of 128, 64, 32, 64, 32, 16 bytes) will be handled by a masked vmovdqu that will load 1 to 15 bytes. I made some optimizations in my translation, such as making a negation of the crc value part of the final vternlogq operation. When compiled with GCC 13.2, the "reflected" polynomial variant crc32_avx512<true> is more than 200 bytes shorter than the hand-written crc32_refl_by16_vclmul_avx512. The compiler seems to be doing a good job at fusing machine instructions.

On a system where I tested the crc32_avx512<true> today, it computed a CRC-32C on a 1-gigabyte buffer in 52 milliseconds of user CPU time. Our previous fastest x86-64 implementation of my_crc32c in https://github.com/MariaDB/server/ would use 160 milliseconds. That code is using AVX xmm registers, 64×64→128 pclmul, and SSE 4.2 crc32 instructions.

Should I create a separate pull request for this?

I have not reviewed your changes, but did a quick test and ran into this.

# ./test/test_crc32
Testing AVX512+VPCLMULQDQ: CRC-32CSegmentation fault (core dumped)

@dr-m
Copy link

dr-m commented May 9, 2024

I have not reviewed your changes, but did a quick test and ran into this.

# ./test/test_crc32
Testing AVX512+VPCLMULQDQ: CRC-32CSegmentation fault (core dumped)

Thank you for taking a look at it, @pbadari. Can you provide some more details, such as the CMAKE_CXX_COMPILER and CMAKE_CXX_FLAGS from the CMakeCache.txt, and a disassembly of the function around the crash? I had tested the code generated with GCC 11, 13, and a few versions of clang between 8 and 18, using two different systems. I only had time limited access to those systems. Intel SDE could be an option for further debugging.

@pbadari
Copy link
Contributor

pbadari commented May 9, 2024

I have not reviewed your changes, but did a quick test and ran into this.

# ./test/test_crc32
Testing AVX512+VPCLMULQDQ: CRC-32CSegmentation fault (core dumped)

Thank you for taking a look at it, @pbadari. Can you provide some more details, such as the CMAKE_CXX_COMPILER and CMAKE_CXX_FLAGS from the CMakeCache.txt, and a disassembly of the function around the crash? I had tested the code generated with GCC 11, 13, and a few versions of clang between 8 and 18, using two different systems. I only had time limited access to those systems. Intel SDE could be an option for further debugging.

I was able to get the test working with this change:

     for (size_t i = sizeof funcs / sizeof(*funcs); i--; ) {
       fputs(funcs[i].name, stderr);
-      if (size_t s = test_buf(buf, sizeof buf, funcs[i].c[1], funcs[i].c[2])) {
+      if (size_t s = test_buf(buf, sizeof buf, funcs[i].c[0], funcs[i].c[1])) {
         fprintf(stderr, "(failed at %zu)", s);
         status = EXIT_FAILURE;
       }

I also did a quick performance test comparing your implementation against the one I am working on. Looks like perf is pretty close. Good thing about your code is that, its common code for crc32 and crc32c. Are you planning to clean up and submit for aws-checksums project?

@DmitriyMusatkin
Copy link
Contributor

Unfortunately, Jonathan is no longer with the CRT team. We'll sync up internally and see if we can find someone else to push those changes along. I spoke with Jonathan briefly and he mentioned that all the latest changes are in following 2 prs and this pr can be close.
#81
#79

@dr-m
Copy link

dr-m commented May 10, 2024

I was able to get the test working with this change:

Thank you. I made this off-by-one mistake in dr-m/crc32_simd@ac55c31 when I implemented an option to invoke the CRC functions on input files. Apparently I forgot to rerun the test program without any arguments, on an AVX512 capable machine where that part of the code would run. I remember that I did run it with some arguments.

I also did a quick performance test comparing your implementation against the one I am working on. Looks like perf is pretty close. Good thing about your code is that, its common code for crc32 and crc32c. Are you planning to clean up and submit for aws-checksums project?

Yes, I would love to. I looked around and did not find any other SIMD optimized CRC library for C or C++ that would target multiple ISA (other than as part of a bigger project, such as the mysys/crc32 subdirectory of MariaDB Server). For example, boost::crc is SISD only. I see that this library is C, so some minor refactoring would be needed.

Do you have any use for the non-reflected polynomials? If not, the conversion should be rather trivial.

@pbadari
Copy link
Contributor

pbadari commented May 10, 2024

Unfortunately, Jonathan is no longer with the CRT team. We'll sync up internally and see if we can find someone else to push those changes along. I spoke with Jonathan briefly and he mentioned that all the latest changes are in following 2 prs and this pr can be close. #81 #79

Along with those PRs, we need following to be merged:

#88: incorrect checksum due to extra bit flip.
#89: add support for crc32 also

@pbadari
Copy link
Contributor

pbadari commented May 10, 2024

I was able to get the test working with this change:

Thank you. I made this off-by-one mistake in dr-m/crc32_simd@ac55c31 when I implemented an option to invoke the CRC functions on input files. Apparently I forgot to rerun the test program without any arguments, on an AVX512 capable machine where that part of the code would run. I remember that I did run it with some arguments.

I also did a quick performance test comparing your implementation against the one I am working on. Looks like perf is pretty close. Good thing about your code is that, its common code for crc32 and crc32c. Are you planning to clean up and submit for aws-checksums project?

Yes, I would love to. I looked around and did not find any other SIMD optimized CRC library for C or C++ that would target multiple ISA (other than as part of a bigger project, such as the mysys/crc32 subdirectory of MariaDB Server). For example, boost::crc is SISD only. I see that this library is C, so some minor refactoring would be needed.

Yes. Refactoring to C would be nice.

Do you have any use for the non-reflected polynomials? If not, the conversion should be rather trivial.

aws-checksums polymonials are reflected. So, if you can provide only reflected version, would be nice.

dr-m added a commit to dr-m/aws-checksums that referenced this pull request May 11, 2024
This implementation is based on crc32_refl_by16_vclmul_avx512
in https://github.com/intel/intel-ipsec-mb/ with some optimizations.

Some of the code is based on awslabs#72.
@dr-m
Copy link

dr-m commented May 11, 2024

I filed #90 for my version.

dr-m added a commit to dr-m/aws-checksums that referenced this pull request May 25, 2024
This implementation is based on crc32_refl_by16_vclmul_avx512
in https://github.com/intel/intel-ipsec-mb/ with some optimizations.

Changes to CMakeLists.txt and source/intel/asm/crc32c_sse42_asm.c
are based on awslabs#72.

This also fixes a bug in aws_checksums_crc32c_hw() when 128-bit pclmul
is not available. crc_intrin_fn was being invoked on bytes instead
of 32-bit or 64-bit words. The aws-checksums-tests was extended to cover
all SIMD implementations.

Note: The availability of the Intel CRC-32C instructions is checked
as part of testing AWS_CPU_FEATURE_SSE_4_2. Both ISA extensions were
introduced in the Intel Nehalem microarchitecture.

For compiling this, https://github.com/awslabs/aws-c-common must be
installed and CMAKE_MODULE_PATH must point to it, e.g.:
cmake -DCMAKE_MODULE_PATH=/usr/local/lib/cmake.

The AWS_CPU_FEATURE_AVX512 currently only checks for AVX512F and not
other features that this implementation depends on:
AVX512VL, AVX512BW, AVX512DQ. According to
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
there currently exist no CPUs that would support VPCLMULQDQ without
supporting all those AVX512 features.

The architecture target evex512 is something that was introduced as
mandatory in GCC 14 and clang 18 as part of introducing the AVX10.1-512
target, which basically is a new name for a number of AVX512 features.
Older compilers do not recognize this target, but they do emit EVEX
encoded instructions.
dr-m added a commit to dr-m/aws-checksums that referenced this pull request Jul 8, 2024
This implementation is based on crc32_refl_by16_vclmul_avx512
in https://github.com/intel/intel-ipsec-mb/ with some optimizations.

Changes to CMakeLists.txt and source/intel/asm/crc32c_sse42_asm.c
are based on awslabs#72.

This also fixes a bug in aws_checksums_crc32c_hw() when 128-bit pclmul
is not available. crc_intrin_fn was being invoked on bytes instead
of 32-bit or 64-bit words. The aws-checksums-tests was extended to cover
all SIMD implementations.

Note: The availability of the Intel CRC-32C instructions is checked
as part of testing AWS_CPU_FEATURE_SSE_4_2. Both ISA extensions were
introduced in the Intel Nehalem microarchitecture.

For compiling this, https://github.com/awslabs/aws-c-common must be
installed and CMAKE_MODULE_PATH must point to it, e.g.:
cmake -DCMAKE_MODULE_PATH=/usr/local/lib/cmake.

The AWS_CPU_FEATURE_AVX512 currently only checks for AVX512F and not
other features that this implementation depends on:
AVX512VL, AVX512BW, AVX512DQ. According to
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
there currently exist no CPUs that would support VPCLMULQDQ without
supporting all those AVX512 features.

The architecture target evex512 is something that was introduced as
mandatory in GCC 14 and clang 18 as part of introducing the AVX10.1-512
target, which basically is a new name for a number of AVX512 features.
Older compilers do not recognize this target, but they do emit EVEX
encoded instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

s_has_vpclmulqdq() is not checking correct bit to detect VPCLMULQDQ