Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX512 support for CRC32c implementation (on Intel platforms) #68

Closed
wants to merge 6 commits into from

Conversation

pbadari
Copy link
Contributor

@pbadari pbadari commented Jun 28, 2023

Issue #, if available:

Description of changes:
Add AVX512 optimization for CRC32c implementation when available.

AVX512 optimized CRC32c implementation provides 15-70% performance boost over the SSE42 implementation.

Performance Data:

Following are the performance run results current and new crc32c implementation of various buffer sizes.
(Each implementation is called 200 times on each buffer)

                      Buffersize                        SSE42( Current)                      AVX512 (new)
                                                            #of sec/cycle count              #of sec/cycle count    %improvement

Buffer size(K): 0.50 sse42-time: 5 (13786) avx512-time: 4 (8380) (39.21%)
Buffer size(K): 1.00 sse42-time: 8 (26364) avx512-time: 5 (11790) (55.28%)
Buffer size(K): 4.00 sse42-time: 29 (103973) avx512-time: 10 (33036) (68.23%)
Buffer size(K): 16.00 sse42-time: 111 (415911) avx512-time: 34 (124540) (70.06%)
Buffer size(K): 64.00 sse42-time: 448 (1674641) avx512-time: 143 (505714) (69.80%)
Buffer size(K): 128.00 sse42-time: 891 (3364668) avx512-time: 250 (913173) (72.86%)
Buffer size(K): 256.00 sse42-time: 1781 (6743055) avx512-time: 498 (1826869) (72.91%)
Buffer size(K): 512.00 sse42-time: 3566 (13507649) avx512-time: 992 (3656864) (72.93%)
Buffer size(K): 1024.00 sse42-time: 7134 (27031239) avx512-time: 2031 (7319223) (72.92%)
Buffer size(K): 4096.00 sse42-time: 38736 (146790024) avx512-time: 32525 (123236655) (16.05%)
Buffer size(K): 8192.00 sse42-time: 108094 (409707156) avx512-time: 73586 (278907186) (31.93%)
Buffer size(K): 16384.00 sse42-time: 146284 (554442499) avx512-time: 97824 (370768645) (33.13%)
Buffer size(K): 32768.00 sse42-time: 251401 (952912617) avx512-time: 183591 (695846812) (26.98%)
Buffer size(K): 65536.00 sse42-time: 522123 (1979099475) avx512-time: 400916 (1519591618) (23.22%)
Buffer size(K): 131072.00 sse42-time: 1586697 (6013914985) avx512-time: 1193713 (4524356415) (24.77%)
Buffer size(K): 262144.00 sse42-time: 3971185 (15052170297) avx512-time: 3205870 (12151303651) (19.27%)
Buffer size(K): 524288.00 sse42-time: 8858393 (33576763915) avx512-time: 7368324 (27929180421) (16.82%)
Buffer size(K): 1048576.00 sse42-time: 18236703 (69127153546) avx512-time: 15343947 (58161536028) (15.86%)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@TingDaoK TingDaoK requested a review from JonathanHenson July 10, 2023 18:03
@JonathanHenson
Copy link
Contributor

Hi @pbadari thanks for this! I'm having a look and will provide comments and review as soon as possible.

@@ -66,9 +66,11 @@ if (USE_CPU_EXTENSIONS)
source_group("Source Files\\intel\\visualc" FILES ${AWS_ARCH_SRC})

elseif(AWS_ARCH_INTEL AND AWS_HAVE_GCC_INLINE_ASM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the inline check correct now? this pulled the inline asm and we really want to check that the intrinsics are available. IIRC there's instances where the assembler installed in the path is newer than the gcc version, would that drift slip through here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work as expected. We tested on CentOS 8 using gcc 13.1 as 2.36 and Ubuntu 23.04 using gcc 12.2 as 2.4

-- Performing Test AWS_ARCH_INTEL - Success

-- Performing Test AWS_ARCH_ARM64

-- Performing Test AWS_ARCH_ARM64 - Failed

-- Performing Test AWS_ARCH_ARM32

-- Performing Test AWS_ARCH_ARM32 - Failed

-- Performing Test AWS_HAVE_GCC_INLINE_ASM

-- Performing Test AWS_HAVE_GCC_INLINE_ASM - Success

Regarding you question the answer yes, but we can add some checks for intrinsics.

    check_c_source_runs(“

        #include <nmmintrin.h>

        int main() {

            __m128i a = _mm_setzero_si128();

            return 0;

        }

    “ SSE42_SUPPORTED)

    check_c_source_runs(“

        #include <immintrin.h>

        int main() {

            __m512 a = _mm512_setzero_ps();

            return 0;

        }

    “ AVX512_SUPPORTED)

If this doesn't answer your question, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not at all. We will need support for SSE42 and AVX512 intrinsics since AVX2 only supports 256-bit wide vector registers. Should we add sse42 and avx512 intrinsic support to that cmake file or directly to aws checksum CMakeLists.txt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add it to aws-c-common so it can be used more broadly, and so we don't wind up with these scatter shot everywhere. Unfortunately that means we need to coordinate merges now. I'll go ahead and file the PR for aws-c-common and post it here. In the meantime we have some build-chain wiring to sort out anyways and we can get going on that.

For future readers of this, I suspect this might cause some Rust build issues as I doubt AVX512 is turned on by default in cargo and those build flags are usually global to the build process. So if your linker is complaining about it, try making sure AVX512 is turned on in cargo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re rust build questions: AVX512 should only be a codegen-time thing, I wouldn't expect it to depend on anything configured on the linker.

@@ -6,6 +6,12 @@
#include <aws/checksums/private/crc_priv.h>

#include <aws/common/cpuid.h>
#include <emmintrin.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mv source/intel/asm source/intel/intrin

then we need to see if the visualc code will actually work because i think that's just using the other intrinsics. Also, I think we can just delete the visualc code if we're not using any asm. Can we delete visualc and have cmake use this file for both and see what happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results moving the asm folder to intrin
mv source/intel/asm source/intel/intrin
[root@aws-spr4 build]# make
[ 16%] Building C object CMakeFiles/aws-checksums.dir/source/crc.c.o
[ 33%] Building C object CMakeFiles/aws-checksums.dir/source/crc_sw.c.o
[ 50%] Linking C static library libaws-checksums.a
[ 50%] Built target aws-checksums
[ 66%] Building C object tests/CMakeFiles/aws-checksums-tests.dir/test_runner.c.o
[ 83%] Building C object tests/CMakeFiles/aws-checksums-tests.dir/crc_test.c.o
[100%] Linking C executable aws-checksums-tests
/usr/bin/ld: ../libaws-checksums.a(crc.c.o): in function aws_checksums_crc32': crc.c:(.text+0x41): undefined reference to aws_checksums_crc32_hw'
/usr/bin/ld: ../libaws-checksums.a(crc.c.o): in function aws_checksums_crc32c': crc.c:(.text+0xa1): undefined reference to aws_checksums_crc32c_hw'
collect2: error: ld returned 1 exit status
make[2]: *** [tests/CMakeFiles/aws-checksums-tests.dir/build.make:115: tests/aws-checksums-tests] Error 1
make[1]: *** [CMakeFiles/Makefile2:882: tests/CMakeFiles/aws-checksums-tests.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

Removing source/intel/visualc folder

[root@aws-spr4 build]# make
[ 14%] Building C object CMakeFiles/aws-checksums.dir/source/crc.c.o
[ 28%] Building C object CMakeFiles/aws-checksums.dir/source/crc_sw.c.o
[ 42%] Building C object CMakeFiles/aws-checksums.dir/source/intel/asm/crc32c_sse42_asm.c.o
[ 57%] Linking C static library libaws-checksums.a
[ 57%] Built target aws-checksums
[ 71%] Building C object tests/CMakeFiles/aws-checksums-tests.dir/test_runner.c.o
[ 85%] Building C object tests/CMakeFiles/aws-checksums-tests.dir/crc_test.c.o
[100%] Linking C executable aws-checksums-tests
[100%] Built target aws-checksums-tests
[root@aws-spr4 build]# make install
Consolidate compiler generated dependencies of target aws-checksums
[ 57%] Built target aws-checksums
Consolidate compiler generated dependencies of target aws-checksums-tests
[100%] Built target aws-checksums-tests
Install the project...
-- Install configuration: "Release"
-- Installing: /usr/local/lib64/libaws-checksums.a
-- Up-to-date: /usr/local/include/aws/checksums/crc.h
-- Up-to-date: /usr/local/include/aws/checksums/exports.h
-- Installing: /usr/local/lib64/aws-checksums/cmake/static/aws-checksums-targets.cmake
-- Installing: /usr/local/lib64/aws-checksums/cmake/static/aws-checksums-targets-release.cmake
-- Installing: /usr/local/lib64/aws-checksums/cmake/aws-checksums-config.cmake

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my suspicion on the build failure by moving asm to intrin and updating cmake to find it and it still not building is the cmake cache. I usually just wipe out the build and re-run cmake to make sure it's not that.

The visualc stuff would have to be run on an msvc compiler to check. I have my windows machine up and i'll give it a look and report back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding move asm to intrin. Clean it again and rebuild it with no issues.

@JonathanHenson
Copy link
Contributor

@AGSaidi any chance you could have a look?

@AGSaidi
Copy link
Collaborator

AGSaidi commented Jul 11, 2023

@JonathanHenson I don't know i'm going to be a lot of help with the x86 intrinsics.

@JonathanHenson
Copy link
Contributor

@AGSaidi any chance you could have a look?

thanks, @bdonlan how about you, could you put a second set of eyes on the new asm replacement/AVX512 code?

@JonathanHenson
Copy link
Contributor

added the aws-c-common functionality here:

awslabs/aws-c-common#1041

@JonathanHenson
Copy link
Contributor

I'm working on this one now. I need to do some build surgery to handle that the SSE4.2 implementation on windows can't use asm, but the intrinsics version of AVX512 can be used. I'm going to split the files up and do some extra indirection.

@JonathanHenson
Copy link
Contributor

I just noticed we don't have any test cases with inputs large enough to trigger the avx512 path, i'll add some tests for it.

@jmklix
Copy link
Member

jmklix commented Feb 6, 2024

Closing in favor of new PR: #72

@jmklix jmklix closed this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants