[WIP] Arm®v9-A architecture SME2 SGEMM kernels #5011

AymenQ · 2024-12-09T17:47:51Z

Add implementation of SGEMM based on the Arm®v9-A architecture Scalable Matrix Extension (SME), using the Arm C Language Extensions (ACLE).

Includes addition of a new target, ARMV9SME, for generic SME2 targets. This new target inherits existing ARMV8SVE settings by default. It can only be build using an SME-capable toolchain such as GCC 14 or LLVM 19.

The SME2 kernel performs outer products on panels of A and B, accumulating into 2x2 inner blocks of C via the SME two-dimensional architectural register, ZA.

Note: this is a WIP target. It is functional for SGEMM, and all GEMM tests are passing. Other BLAS3 routines have not been updated to match the larger kernel size, so SYMM/TRMM tests are currently expected to fail in this WIP state.

martin-frbg · 2024-12-10T16:31:27Z

Thank you - I started working on this but in assembly (based on the example in the developer docs) but predictably got lost :/
My initial code introduced a HAVE_SME flag that would get set during cpu (auto)detection in addition to ARMV8SVE or
ARMV9, but I guess there is no practical difference either way. (The only place where the corresponding HAVE_SVE flag is used is in the DOT kernel)
There will be no functional test failures in CI as the best we are fielding at the moment is NeoverseV1 (I guess V2 is only fractionally more expensive now in AWS so we could probably upgrade the Cirun job). Compile failures seen are with Apple's own version of Clang that needs some help (via explicit casts) to disambiguate intrinsics that are available in 32 and 64bit forms.

Add a new target, ARMV9SME, for Arm®v9-A architecture systems that support the Scalable Matrix Extension (SME) [1]. Initially inherits ARMV8SVE settings with updated compiler flags. This target can only be built with an SME-capable toolchain such as GCC 14 or LLVM 19. Includes some initial FEAT_SME2 feature detection on Linux targets via hwcaps. Target is disabled in DYNAMIC_ARCH builds by default. This is intended as a base target for SME2 kernels. [1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2

Add implementation of SGEMM based on the Arm®v9-A architecture Scalable Matrix Extension (SME) [1], using the Arm C Language Extensions (ACLE) [2]. Add SME2 compute & packing kernels for SGEMM and enable them under the ARMV9SME target. The compute kernel performs outer products on panels of A and B, accumulating into 2x2 inner blocks of C via the SME two-dimensional architectural register, ZA. The non-transpose packing kernel performs a copy into a contiguous buffer using SVE loads & stores in Streaming SVE mode. Streaming SVE is an execution mode introduced by SME that supports execution of SVE code with the SME defined vector length, known as the Streaming SVE vector length (SVL). The transpose packing kernel performs on-the-fly transposition by utilizing horizontal & vertical tile slice access to the SME ZA register. Includes an update to the driver to account for expanded inner block. Note: this places the ARMV9SME target in WIP state. It is functional for SGEMM, and all GEMM tests are passing. Other BLAS3 routines have not been updated to match the larger kernel size, so SYMM/TRMM tests are currently expected to fail in this WIP state. [1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2 [2] https://arm-software.github.io/acle/main/acle.html

AymenQ · 2024-12-11T19:44:36Z

@martin-frbg Thanks! The CI failures on MacOS should hopefully be sorted now, though there might just be a timeout on the Windows run that maybe needs re-running. Indeed, I don't think there's any systems available in the current CI setup that can functionally test this, but it's good to see the target built.

With regards to a HAVE_SME flag: I'm happy to change this PR's approach, though I thought for now it might be best to keep this implementation isolated under its own target.

Out of interest: once it's been through review, are we able to merge this code in as a WIP target (disabled by default) with some routines left non-functional, or would you rather it left as an open PR for the time being? If we do merge this as-is, later PRs can fill out the missing functionality and improve performance.

martin-frbg · 2024-12-11T20:30:59Z

Thank you for fixing the Apple jobs. The one Windows job timing out is a semi-regular annoyance caused by the heterogenicity of the Azure cloud - sometimes the CI job gets scheduled on some old hardware that cannot complete the compilation within the allocated hour. I'm all for merging your work as soon as possible, and I'm currently trying to see if it is possible to separate the SSYMM and STRMM implementations from SGEMM, redirecting them to existing implementations. (There is ample precedent in for TRMM, but I'm not sure if it can be made to work for SYMM too).

AymenQ · 2024-12-12T15:39:12Z

Thanks, that makes sense!
I wasn't too sure about a clean way to isolate the SGEMM implementation. My understanding is that the usage of GEMM_KERNEL_[N,L,R,B] is widespread across the level 3 drivers, so it seemed difficult without introducing additional kernel types. How would you suggest separating out TRMM (since you mentioned there's precedent)? I'd be happy to give it a go.

martin-frbg · 2024-12-12T16:53:36Z

The idea is to set USE_TRMM=1 for your new target core in kernel/Makefile.L3 and specify a separate source file for the STRMMKERNEL in KERNEL.ARMV9SME, with the limitation that the TRMM one needs to use the same GEMM_UNROLL_M and N parameters as its GEMM companion. There is a 1x8 strmm_kernel_sve but I gather that it is not trivially easy to use it in streaming SVE mode without also changing its register use (?) , so kernel/generic/trmmkernel4x8.c would be the one to try.

martin-frbg · 2024-12-13T15:14:51Z

Hmm, on second thought not sure if we can get at the non-sme copy routines we'd need for that generic TRMM kernel to work alongside the SME GEMM.

AymenQ force-pushed the armv9-a-sme branch from 36d73d7 to 356d03c Compare December 11, 2024 11:58

AymenQ added 2 commits December 11, 2024 17:54

AymenQ force-pushed the armv9-a-sme branch from 356d03c to 3d282c9 Compare December 11, 2024 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Arm®v9-A architecture SME2 SGEMM kernels #5011

[WIP] Arm®v9-A architecture SME2 SGEMM kernels #5011

AymenQ commented Dec 9, 2024

martin-frbg commented Dec 10, 2024

AymenQ commented Dec 11, 2024

martin-frbg commented Dec 11, 2024

AymenQ commented Dec 12, 2024

martin-frbg commented Dec 12, 2024

martin-frbg commented Dec 13, 2024

[WIP] Arm®v9-A architecture SME2 SGEMM kernels #5011

Are you sure you want to change the base?

[WIP] Arm®v9-A architecture SME2 SGEMM kernels #5011

Conversation

AymenQ commented Dec 9, 2024

martin-frbg commented Dec 10, 2024

AymenQ commented Dec 11, 2024

martin-frbg commented Dec 11, 2024

AymenQ commented Dec 12, 2024

martin-frbg commented Dec 12, 2024

martin-frbg commented Dec 13, 2024