Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Arm®v9-A architecture SME2 SGEMM kernels #5011

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

AymenQ
Copy link

@AymenQ AymenQ commented Dec 9, 2024

Add implementation of SGEMM based on the Arm®v9-A architecture Scalable Matrix Extension (SME), using the Arm C Language Extensions (ACLE).

Includes addition of a new target, ARMV9SME, for generic SME2 targets. This new target inherits existing ARMV8SVE settings by default. It can only be build using an SME-capable toolchain such as GCC 14 or LLVM 19.

The SME2 kernel performs outer products on panels of A and B, accumulating into 2x2 inner blocks of C via the SME two-dimensional architectural register, ZA.

Note: this is a WIP target. It is functional for SGEMM, and all GEMM tests are passing. Other BLAS3 routines have not been updated to match the larger kernel size, so SYMM/TRMM tests are currently expected to fail in this WIP state.

@martin-frbg
Copy link
Collaborator

Thank you - I started working on this but in assembly (based on the example in the developer docs) but predictably got lost :/
My initial code introduced a HAVE_SME flag that would get set during cpu (auto)detection in addition to ARMV8SVE or
ARMV9, but I guess there is no practical difference either way. (The only place where the corresponding HAVE_SVE flag is used is in the DOT kernel)
There will be no functional test failures in CI as the best we are fielding at the moment is NeoverseV1 (I guess V2 is only fractionally more expensive now in AWS so we could probably upgrade the Cirun job). Compile failures seen are with Apple's own version of Clang that needs some help (via explicit casts) to disambiguate intrinsics that are available in 32 and 64bit forms.

Add a new target, ARMV9SME, for Arm®v9-A architecture systems that
support the Scalable Matrix Extension (SME) [1].

Initially inherits ARMV8SVE settings with updated compiler flags. This
target can only be built with an SME-capable toolchain such as GCC 14 or
LLVM 19.

Includes some initial FEAT_SME2 feature detection on Linux targets via
hwcaps. Target is disabled in DYNAMIC_ARCH builds by default.

This is intended as a base target for SME2 kernels.

[1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2
Add implementation of SGEMM based on the Arm®v9-A architecture Scalable
Matrix Extension (SME) [1], using the Arm C Language Extensions (ACLE)
[2].

Add SME2 compute & packing kernels for SGEMM and enable them under the
ARMV9SME target.

The compute kernel performs outer products on panels of A and B,
accumulating into 2x2 inner blocks of C via the SME two-dimensional
architectural register, ZA.

The non-transpose packing kernel performs a copy into a contiguous
buffer using SVE loads & stores in Streaming SVE mode. Streaming SVE is
an execution mode introduced by SME that supports execution of SVE code
with the SME defined vector length, known as the Streaming SVE vector
length (SVL).

The transpose packing kernel performs on-the-fly transposition by
utilizing horizontal & vertical tile slice access to the SME ZA
register.

Includes an update to the driver to account for expanded inner block.

Note: this places the ARMV9SME target in WIP state. It is functional for
SGEMM, and all GEMM tests are passing. Other BLAS3 routines have not
been updated to match the larger kernel size, so SYMM/TRMM tests are
currently expected to fail in this WIP state.

[1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2
[2] https://arm-software.github.io/acle/main/acle.html
@AymenQ
Copy link
Author

AymenQ commented Dec 11, 2024

@martin-frbg Thanks! The CI failures on MacOS should hopefully be sorted now, though there might just be a timeout on the Windows run that maybe needs re-running. Indeed, I don't think there's any systems available in the current CI setup that can functionally test this, but it's good to see the target built.

With regards to a HAVE_SME flag: I'm happy to change this PR's approach, though I thought for now it might be best to keep this implementation isolated under its own target.

Out of interest: once it's been through review, are we able to merge this code in as a WIP target (disabled by default) with some routines left non-functional, or would you rather it left as an open PR for the time being? If we do merge this as-is, later PRs can fill out the missing functionality and improve performance.

@martin-frbg
Copy link
Collaborator

Thank you for fixing the Apple jobs. The one Windows job timing out is a semi-regular annoyance caused by the heterogenicity of the Azure cloud - sometimes the CI job gets scheduled on some old hardware that cannot complete the compilation within the allocated hour. I'm all for merging your work as soon as possible, and I'm currently trying to see if it is possible to separate the SSYMM and STRMM implementations from SGEMM, redirecting them to existing implementations. (There is ample precedent in for TRMM, but I'm not sure if it can be made to work for SYMM too).

@AymenQ
Copy link
Author

AymenQ commented Dec 12, 2024

Thanks, that makes sense!
I wasn't too sure about a clean way to isolate the SGEMM implementation. My understanding is that the usage of GEMM_KERNEL_[N,L,R,B] is widespread across the level 3 drivers, so it seemed difficult without introducing additional kernel types. How would you suggest separating out TRMM (since you mentioned there's precedent)? I'd be happy to give it a go.

@martin-frbg
Copy link
Collaborator

The idea is to set USE_TRMM=1 for your new target core in kernel/Makefile.L3 and specify a separate source file for the STRMMKERNEL in KERNEL.ARMV9SME, with the limitation that the TRMM one needs to use the same GEMM_UNROLL_M and N parameters as its GEMM companion. There is a 1x8 strmm_kernel_sve but I gather that it is not trivially easy to use it in streaming SVE mode without also changing its register use (?) , so kernel/generic/trmmkernel4x8.c would be the one to try.

@martin-frbg
Copy link
Collaborator

Hmm, on second thought not sure if we can get at the non-sme copy routines we'd need for that generic TRMM kernel to work alongside the SME GEMM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants