Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX512 and VPCLMULQDQ based CRC-32 and CRC-32C #90

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dr-m
Copy link

@dr-m dr-m commented May 11, 2024

Issue #, if available: #72

Description of changes: This implementation in crc32_avx512.c is based on crc32_refl_by16_vclmul_avx512 in https://github.com/intel/intel-ipsec-mb/ with some optimizations.

Some of the code is based on #72. Because this repository appears to depend on definitions in other repositories, I was not able to compile and test this locally. I merely tested that gcc -O2 -std=c11 -c crc32_avx512.c produces reasonable looking output.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@pbadari
Copy link
Contributor

pbadari commented May 14, 2024

  1. Bunch of compile failures on my avx512 enabled system.
  2. Instead of implementing aws_checksums_hw_detect(), we should use the one provided in aws-c-common:
    eg.
+    if (AWS_UNLIKELY(!detection_performed)) {
+        detected_avx512 = aws_cpu_has_feature(AWS_CPU_FEATURE_AVX512);
+        detected_vpclmulqdq = aws_cpu_has_feature(AWS_CPU_FEATURE_VPCLMULQDQ);
+        detection_performed = true;

@dr-m
Copy link
Author

dr-m commented May 15, 2024

  1. Bunch of compile failures on my avx512 enabled system.

Thank you, I was aware that this repository depends on something else, and now I know that it is https://github.com/awslabs/aws-c-common/. After make install of that and

cmake -DCMAKE_MODULE_PATH=/usr/local/lib/cmake /path/to/aws-checksums

I got the CMake logic to work as apparently intended, and fixed the compilation failures in a0a8785.

While working on this, I found the following:

/usr/bin/cc -DHAVE_SYSCONF -DINTEL_NO_ITTNOTIFY_API -I/path/to/aws-checksums/include -std=gnu99 -Wall -Wstrict-prototypes -fno-omit-frame-pointer -Wextra -pedantic -Wno-long-long -fPIC  -mavx512f -mvpclmulqdq -mpclmul -mavx -mavx2 -msse4.2 -MD -MT CMakeFiles/aws-checksums.dir/source/intel/crc_hw.c.o -MF CMakeFiles/aws-checksums.dir/source/intel/crc_hw.c.o.d -o CMakeFiles/aws-checksums.dir/source/intel/crc_hw.c.o -c /path/to/aws-checksums/source/intel/crc_hw.c
/path/to/aws-checksums/source/intel/crc_hw.c:12:8: error: unknown type name ‘bool’
   12 | static bool detection_performed;

The -std=gnu99 apparently comes from the C_STANDARD 99 specified in include(AwsCFlags). In C99, #include <stdbool.h> defines bool.

This compiler invocation indicates a problem with include(AwsSIMD): it apparently is enabling all ISA extensions in all source files. That is risky, because the compiler could make use of some ISA extensions in all code, including code that is not covered by any cpuid detection. A real-world consequence of such unsafe logic (SIGILL on some CPUs) was addressed in MariaDB/server@58f184a.

I think that it is clearer to define target attributes for those functions that make use of ISA extensions. The only limitation that I know of is that with GCC 4, you can only use the Intel intrinsic headers if you first enable the ISA extensions with -m flags. This problem was fixed in GCC 5. An alternative would be to use one compilation unit per a set of ISA extensions, and enable the extensions for each compilation unit individually.

2. Instead of implementing aws_checksums_hw_detect(), we should use the one provided in aws-c-common:
   eg.
+    if (AWS_UNLIKELY(!detection_performed)) {
+        detected_avx512 = aws_cpu_has_feature(AWS_CPU_FEATURE_AVX512);
+        detected_vpclmulqdq = aws_cpu_has_feature(AWS_CPU_FEATURE_VPCLMULQDQ);
+        detection_performed = true;

As far as I understand, aws_cpu_has_feature(AWS_CPU_FEATURE_AVX512) is not a sufficient condition, because it only refers to the AVX512F feature flag. The implementation depends on AVX512DQ, AVX512BW and AVX512VL as well.

I see that there are some branches in aws-c-common that might be more appropriate to use here, such as avx512runtimedetection. Maybe this has already been addressed in some branch, or something needs to be revised further. I did not revise the CPU detection yet.

@pbadari
Copy link
Contributor

pbadari commented May 16, 2024

  1. I still see lot of warning when compiling

warning: ISO C99 does not support ‘_Alignas’ [-Wpedantic]
34 | _Alignas(64) const uint64_t b896[6]; /* includes b786, b640 */
| ^~~~~~~~

etc..

warning: pointer targets in passing argument 1 of ‘load512’ differ in signedness [-Wpointer-sign]
162 | lo = xor512(load512(buf), crc_in);
| ^~~
| |
| const uint8_t * {aka const unsigned char *}

etc..

  1. Currently the new code doesn't get executed and needs something like..
--- a/source/crc.c
+++ b/source/crc.c
@@ -12,7 +12,7 @@ static uint32_t (*s_crc32_fn_ptr)(const uint8_t *input, int length, uint32_t pre
 
 uint32_t aws_checksums_crc32(const uint8_t *input, int length, uint32_t previousCrc32) {
     if (AWS_UNLIKELY(!s_crc32_fn_ptr)) {
-        if (aws_cpu_has_feature(AWS_CPU_FEATURE_ARM_CRC)) {
+        if (aws_cpu_has_feature(AWS_CPU_FEATURE_AVX512) || aws_cpu_has_feature(AWS_CPU_FEATURE_ARM_CRC)) {
             s_crc32_fn_ptr = aws_checksums_crc32_hw;
         } else {
             s_crc32_fn_ptr = aws_checksums_crc32_sw;

@dr-m
Copy link
Author

dr-m commented May 17, 2024

  1. I still see lot of warning when compiling

Sorry, I somehow overlooked those warnings. Fixed in c69772c.

2. Currently the new code doesn't get executed and needs something like..

Right, we would need that for the zlib crc32() compatible function. The CRC-32C acceleration should work without any changes, because anything that supports AVX512F would also support SSE4.2.

I will try to refactor the ISA extension detection. I think that s_crc32_fn_ptr and s_crc32c_fn_ptr should point to one specific implementation, so there will be no single entry point aws_checksums_crc32_hw or aws_checksums_crc32c_hw but specific implementations, such as aws_checksums_crc32_sse42 or aws_checksums_crc32_avx512. The ISA extension detection functions would return a pointer to a specific implementation.

@dr-m
Copy link
Author

dr-m commented May 17, 2024

While refactoring the ISA extension detection in 7c6a87a, I found a bug in your existing implementation, in case the carry-less multiplication is not available:

diff --git a/source/intel/crc_hw.c b/source/intel/crc_hw.c
--- a/source/intel/crc_hw.c
+++ b/source/intel/crc_hw.c
@@ -95,13 +95,13 @@ uint32_t aws_checksums_crc32c_hw(const uint8_t *input, int length, uint32_t prev
         crc = (uint32_t)_mm_crc32_u8(crc, *input++);
     }
 
-    if (detected_sse42 && detected_clmul) {
+    if (0) {
         return aws_checksums_crc32c_sse42(input, length, crc);
     }
 
     /* Spin through remaining (aligned) 8-byte chunks using the CRC32Q quad word instruction */
     while (length >= (int)sizeof(slice_ptr_int_type)) {
-        crc = (uint32_t)crc_intrin_fn(crc, *input);
+        crc = (uint32_t)crc_intrin_fn(crc, *(const slice_ptr_int_type*)input);
         input += sizeof(slice_ptr_int_type);
         length -= (int)sizeof(slice_ptr_int_type);
     }

The crc32 instructions that this function is using are already kind of part of SSE 4.2; they were introduced in the Intel Nehalem microarchitecture. So, the check detected_sse42 is already kind of late here. With the above fix, also the generated code looks saner: crc32q is being invoked on a memory argument, instead of on a register that was movzbl’d from memory.

Note: my CPU detection depends on the following patch to awslabs/aws-c-common@6ebf3bc:

diff --git a/source/arch/intel/cpuid.c b/source/arch/intel/cpuid.c
index 465fccd1..03c52993 100644
--- a/source/arch/intel/cpuid.c
+++ b/source/arch/intel/cpuid.c
@@ -88,9 +88,10 @@ static bool s_has_avx2(void) {
 static bool s_has_avx512(void) {
     uint32_t abcd[4];
 
-    /* Check AVX512F:
-     * CPUID.(EAX=07H, ECX=0H):EBX.AVX512[bit 16]==1 */
-    uint32_t avx512_mask = (1 << 16);
+    /* Check AVX512 flags.
+     * CPUID.(EAX=07H, ECX=0H):EBX */
+    uint32_t avx512_mask = 1 << 16 /* AVX512F */ | 1 << 17 /* AVX512DQ */ |
+      1U << 30 /* AVX512BW */ | 1U << 31 /* AVX512VL */;
     aws_run_cpuid(7, 0, abcd);
     if ((abcd[1] & avx512_mask) != avx512_mask) {
         return false;

It could be clearer to rename AWS_CPU_FEATURE_AVX512 to something like AWS_CPU_FEATURE_AVX10_1_512 and make it cover at least these 4 AVX512 flags. To my understanding, avx10.1-512 is just a new name for a bunch of existing AVX512 ISA extensions, and avx10.1-256 is a name for a 256-bit subset. Any further extensions avx10.2-256 and possibly avx10.2-512 would introduce new cpuid flags.

By the way, I see that source/crc.c is being compiled without any ISA extension flags. But source/intel/crc_hw.c and source/intel/asm/crc32c_sse42_asm.c are in danger: you are enabling among others AVX2, AVX512F (but not other AVX512 flags) while compiling them. As far as I can tell, it would only be safe to enable -msse4.2 for the first file, and -msse4.2 -mpclmul for the second one.

@pbadari
Copy link
Contributor

pbadari commented May 17, 2024

One more change needed:

--- a/source/crc.c
+++ b/source/crc.c
@@ -43,7 +43,7 @@ uint32_t aws_checksums_crc32c(const uint8_t *input, int length, uint32_t previou
 # ifdef AWS_HAVE_AVX512_INTRINSICS
         if (aws_cpu_has_feature(AWS_CPU_FEATURE_AVX512) &&
             aws_cpu_has_feature(AWS_CPU_FEATURE_VPCLMULQDQ))
-            s_crc32_fn_ptr = aws_checksums_crc32c_avx512;
+            s_crc32c_fn_ptr = aws_checksums_crc32c_avx512;
        else
 # endif
        if (aws_cpu_has_feature(AWS_CPU_FEATURE_SSE_4_2)) {

@dr-m
Copy link
Author

dr-m commented May 18, 2024

One more change needed:

Thank you. My bad, I do not currently have access to AVX512 capable hardware. Fixed in b06f512. I also fixed the ISA extension trouble in 5095452. I do not yet have a solution for skipping the compilation of the AVX512 code on older compilers (clang 7 or earlier, GCC 10 or older). The simplest solution that I can come up with would be to merge the AVX512 code to source/intel/crc_hw.c and then use #if magic to enable the AVX512 only if the compiler is recent enough. (I’d bravely assume that there are no other compilers than GCC, MSVC, and clang and its derivatives.) That is what I ended up doing in MariaDB/server@9ec7819.

The inaccurate AWS_CPU_FEATURE_AVX512 still needs to be addressed.

@dr-m
Copy link
Author

dr-m commented May 23, 2024

I do not yet have a solution for skipping the compilation of the AVX512 code on older compilers (clang 7 or earlier, GCC 10 or older).

Well, I guess that this is actually covered by the AWS_HAVE_AVX512_INTRINSICS check in the top-level CMakeCache.txt as well as the corresponding #ifdef in the code. I would love to see some CI results.

The inaccurate AWS_CPU_FEATURE_AVX512 still needs to be addressed.

I would like to hear your opinion on this. We can also ignore this and assume that the combination of checking for AVX512F and VPCLMULQDQ is adequate, that is, all processors that implement those will also implement AVX512DQ, AVX512BW, and AVX512VL.

@dr-m
Copy link
Author

dr-m commented May 25, 2024

If https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 can be trusted, VPCLMULQDQ implies not only AVX512F but also AVX512VL, AVX512BW, AVX512DQ. Hence, testing for AWS_CPU_FEATURE_AVX512 and AWS_CPU_FEATURE_VPCLMULQDQ could be acceptable in practice.

I squashed all commits to 2e84a92.

This implementation is based on crc32_refl_by16_vclmul_avx512
in https://github.com/intel/intel-ipsec-mb/ with some optimizations.

Changes to CMakeLists.txt and source/intel/asm/crc32c_sse42_asm.c
are based on awslabs#72.

This also fixes a bug in aws_checksums_crc32c_hw() when 128-bit pclmul
is not available. crc_intrin_fn was being invoked on bytes instead
of 32-bit or 64-bit words. The aws-checksums-tests was extended to cover
all SIMD implementations.

Note: The availability of the Intel CRC-32C instructions is checked
as part of testing AWS_CPU_FEATURE_SSE_4_2. Both ISA extensions were
introduced in the Intel Nehalem microarchitecture.

For compiling this, https://github.com/awslabs/aws-c-common must be
installed and CMAKE_MODULE_PATH must point to it, e.g.:
cmake -DCMAKE_MODULE_PATH=/usr/local/lib/cmake.

The AWS_CPU_FEATURE_AVX512 currently only checks for AVX512F and not
other features that this implementation depends on:
AVX512VL, AVX512BW, AVX512DQ. According to
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
there currently exist no CPUs that would support VPCLMULQDQ without
supporting all those AVX512 features.

The architecture target evex512 is something that was introduced as
mandatory in GCC 14 and clang 18 as part of introducing the AVX10.1-512
target, which basically is a new name for a number of AVX512 features.
Older compilers do not recognize this target, but they do emit EVEX
encoded instructions.
@dr-m
Copy link
Author

dr-m commented Aug 20, 2024

I think that you should test your AVX512 detection logic by starting the Linux kernel with the noxsave option, so that any attempt to use AVX registers would result in SIGILL. I implemented this in dr-m/crc32_simd@e9e8c24 and revised some incorrect logic in MariaDB/server#3471.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants