Canonicalization of GEPs to i8 #1292

alan-baker · 2024-01-29T20:15:57Z

LLVM now canonicalizes some (constant) GEPs to perform address calculations in terms of bytes. This is part of larger change of moving towards ptradd instead of getelementptr. This will require changes in clspv to keep pace.

At the moment I'm seeing different failures:

unit tests with incorrect codegen expectations (~115 per arch)
cts failures that look functionally incorrect

For example, in test/CommonBuiltins/min/half2_fmin.cl the loads are broken down into 4 1-byte loads, a vector creation, and a bitcast from v4uchar to v2half. This is problematic for a few reasons:

Not all devices support byte addressing and clspv now requires it.
It relies on downstream drivers to properly coalesce memory accesses. This might be ok, but I don't have performance numbers that are easily accessible one way or the other.
It muddies the IR. This isn't just an aesthetic concern. There is a binary size increase in these cases due to more instructions getting used.

CTS failures:

basic/vstore_local (fails char2 and uchar2)
relationals/shuffle_copy
relationals/shuffle_function_call
relationals/shuffle_built_in
relationals/shuffle_built_in_dual_input
relationals/shuffle_array_cast

The likely solution is a change in type inference that can find a better type than i8 by looking at the GEP's users (right now it stops at any GEP).

The text was updated successfully, but these errors were encountered:

alan-baker · 2024-01-30T15:17:22Z

I have a local hack that undoes this in certain circumstances. That makes the CTS pass, but this is likely uncovering a bug where we are generating the wrong addressing logic already that is just being exposed by the new code generation.

- fix GetIdxsForTyFromOffset when called on unsized types - rework runOnUpgradeableConstantCasts to be more versatile/general - add a case in runOnImplicitGEP for implicit cast between clspvResourceOrLocal and GEP - fix runOnAllocaNotAliasing to avoid lowering alloca that do not need to be lowered - Update tests - Some tests need more work and are marked FAIL Ref google#1292

rjodinchr · 2024-02-01T16:01:56Z

This change in LLVM is negatively impacting performance on ChromeOS. Between 10% to 30% regression depending on the various benchmark we have.

* update LLVM & rework runOnUpgradeableConstantCasts - fix GetIdxsForTyFromOffset when called on unsized types - rework runOnUpgradeableConstantCasts to be more versatile/general - add a case in runOnImplicitGEP for implicit cast between clspvResourceOrLocal and GEP - fix runOnAllocaNotAliasing to avoid lowering alloca that do not need to be lowered - Update tests - Some tests need more work and are marked FAIL Ref #1292 * remove debug code and add dce to clean a test

The new code is correct. The constant indexing into an array of arrays of floats collapses into a constant index of an array of floats. Ref google#1292

…cl (#1298) The new code is correct. The constant indexing into an array of arrays of floats collapses into a constant index of an array of floats. Ref #1292

alan-baker · 2024-02-05T15:27:18Z

Clspv will generate invalid SPIR-V some cases due to this issue. Take test/Coherent/frexp.cl for example. When compiled with -cl-native-math the resulting SPIR-V generates the following validation error:

error: line 78: GLSL.std.450 Frexp: expected operand Exp data type to be a 32-bit int scalar or vector type
  %50 = OpExtInst %float %49 Frexp %45 %48

The data + 1 results in gep canonicalization happening and clspv pick i8 as the underlying type for data. This is a pre-existing issue that is triggered more frequently now.

I plan to update the coherent tests so here is the source for reference:

kernel void foo(global int *data, global float* x) {
  float y = data[0];
  barrier(CLK_GLOBAL_MEM_FENCE);
  *x = frexp(y, data + 1);
}

refs google#1292 * Convert Coherent tests from .cl to .ll * No longer tests the whole flow and instead is targeted at the SPIRVProducer only * more robust since it is more specific

refs #1292 * Convert Coherent tests from .cl to .ll * No longer tests the whole flow and instead is targeted at the SPIRVProducer only * more robust since it is more specific

alan-baker · 2024-02-05T19:52:16Z

Thinking more about a general solution I think clspv needs to distinguish between the two needs for type inference:

Data type of a particular instruction
Data type of the SPIR-V interface

Type inference can't just look past geps in the first case since we need to know how the type is interpreted for address calculations, but for the second case it's clear we need to have a more holistic view of how the interface is used.

refs google#1292 * Rewrite as more targeted tests to reduce potential perturbations

refs #1292 * Rewrite as more targeted tests to reduce potential perturbations

As the last update of llvm introduced the canonicalization of GEPs to i8, we have a case where the user of a phi nodes is now a i8. This lead the inferering of the phi type to be lowered to i8 while it could have a higher type. To avoid that, track possible upgrade of GEPs of PHI and upgrade them if all the user of the PHI can use the new higher type. This is fixing the performance regression for the ChromeOS workloads. Ref google#1292

refs google#1292 refs google#1303 * Fixed a bug in remove unused arguments * attributes on parameters weren't handled parameters * rewrite tests to be more targeted at the actual feature

refs #1292 refs #1303 * Fixed a bug in remove unused arguments * attributes on parameters weren't handled parameters * rewrite tests to be more targeted at the actual feature

* upgrade GEPs of PHI when possible As the last update of llvm introduced the canonicalization of GEPs to i8, we have a case where the user of a phi nodes is now a i8. This lead the inferering of the phi type to be lowered to i8 while it could have a higher type. To avoid that, track possible upgrade of GEPs of PHI and upgrade them if all the user of the PHI can use the new higher type. This is fixing the performance regression for the ChromeOS workloads. Ref #1292 * simplify test * remove 1292-phi_users.cl --------- Co-authored-by: Alan Baker <[email protected]>

ref google#1292 * Update tests to remove XFAILs

ref #1292 * Update tests to remove XFAILs

dneto0 · 2024-02-09T12:33:14Z

#1306 has an interesting line here: https://github.com/google/clspv/pull/1306/files#diff-d82932682c3cce860c8c728b57e2a68e6b04593c1edb206934c340665fb71f29L16

kernel void bar(global int *A, global int *B, int n) { apple(A + 1, B, n); }

That PR changes the argument from A+1 to A+n, where n is non-constant.

This is an interesting distinction from the perspective of a general solution to the type inference problem. (Because the constant multiple of the element size is folded into the constant offset, and eventually has to be disentangled from a member offset at a deeper level of the type hierarchy. The folding loses information that has to be recovered somehow, in general.)

Any general solution will need to have test cases that do add a constant to an decayed array in this way.

alan-baker · 2024-02-09T14:43:57Z

#1306 has an interesting line here: https://github.com/google/clspv/pull/1306/files#diff-d82932682c3cce860c8c728b57e2a68e6b04593c1edb206934c340665fb71f29L16
kernel void bar(global int *A, global int *B, int n) { apple(A + 1, B, n); }
That PR changes the argument from A+1 to A+n, where n is non-constant.

This is an interesting distinction from the perspective of a general solution to the type inference problem. (Because the constant multiple of the element size is folded into the constant offset, and eventually has to be disentangled from a member offset at a deeper level of the type hierarchy. The folding loses information that has to be recovered somehow, in general.)

Any general solution will need to have test cases that do add a constant to an decayed array in this way.

I expect in the future that the solution applied to that test will also fall over and LLVM will instead produce something akin to:

%mul = mul i32 %n, %size_of_A
%gep = getelementptr i8, ptr addrspace(1) %A, %mul

BukeBeyond · 2024-03-16T17:42:05Z

At the beginning of this year, we implemented the experimental i8 Canonicalization of GEPs in our project. Initially, this feature simply converted offsets to i8, maintaining the original sizes of loads and stores without imposing a speed penalty. Additionally, we incorporated the fix from Issue #1269 and a custom optimization sequence, particularly running Clspv-opt with lower-addrspacecast and replace-llvm-intrinsics passes prior to Clspv.

The impact has been remarkable! Since these changes, Clspv has achieved unprecedented stability. We've successfully compiled a wide array of algorithms and higher-level language constructs from c++26 and b, including custom vectors, templated tensors, lambda chains, recursive filters, and ray tracing, all with almost no crashes from Clspv.

However, the main branches of Clspv and LLVM have seen significant changes recently, which initially prevented us from merging the latest commits without issues. Fortunately, this situation has improved, and we are now able to compile most of our projects using the current and unmodified version of Clspv.

Despite these advances, there is a new challenge: a great number of load operations have become fragmented into bytes, leading to an average 15% performance decrease, even on modern Nvidia hardware. This suggests a possible larger impact were it not for the mitigating effects of the drivers' compiler optimizations or hardwares' memory latency reduction mechanisms.

As engineers, we should strive to restore the original sizes of these loads and stores to minimize their instruction sequences. This will yield more succinct Spirv, and faster loading times as well.

Given the talent and dedication in this project, I am confident we will address this issue soon.

rjodinchr · 2024-03-17T09:42:27Z

@BukeBeyond At the moment it feels like all identified regressions have been addressed from ChromeOS point-of-view. Meaning that we do not have regressions on the benchmarks we run, nor on the OpenCL-CTS.

If you have some small examples using bytes access while it's clear it could coalesce, I'll be happy to have a look at them to try to solve any potential regression.

BukeBeyond · 2024-03-17T17:26:29Z

Fortunately, Romaric fixed this #1318 case of load fragmentation right away. He was able to do it within minutes, whereas the closer investigation took me a couple of hours. :-)

It was a specific case of i32 loads, where we use them extensively for Unified Addressing (and Pointer) simulation under Vulkan. Now, our ray tracer is back to full speed, and with a slightly smaller Spirv size to boot. :-)

There remains only a handful of cases of load fragmentation on our end, but they will take a bit more time to investigate and simplify.

It seems the problems of i8 GEPs and the upcoming ptradd have been mostly solved. Yet the great benefits of this evolution are already here.

We program very interactively, hundreds of compiles per hour.
When first experimenting with Clspv, we would get a few crashes per Hour.
After Romaric's bug fixes last August, Clspv's crash rate dropped to a few a Day.
After the Canonicalization of GEPs to i8, the crashes reduced to just a Few a Month!

So things are looking quite good for the future of Clspv!

Happy Saint Patrick's Day! May luck's gentle hand guide ye, and, of far greater worth, may the warmth of good kin and comrades be ever by your side! :-) 🍀🌈 🥇☘️

BukeBeyond · 2024-03-18T13:40:26Z

And here is the latest fragmentation find #1322.

This was referenced Feb 1, 2024

fix GetIdxsForTyFromOffset with local type #1294

Merged

make sure a gep before a load has matching type #1296

Merged

rjodinchr mentioned this issue Feb 1, 2024

update LLVM & rework runOnUpgradeableConstantCasts #1297

Merged

dneto0 mentioned this issue Feb 1, 2024

Fix expectations for PointerAccessChains/pointer_index_is_constant_1.cl #1298

Merged

alan-baker mentioned this issue Feb 5, 2024

Convert coherent tests to SPIRVProducer tests #1299

Merged

alan-baker added a commit to alan-baker/clspv that referenced this issue Feb 6, 2024

Rewrite variable pointers tests as SPIRVProducer tests

0bcd4f3

refs google#1292 * Rewrite as more targeted tests to reduce potential perturbations

alan-baker mentioned this issue Feb 6, 2024

Rewrite variable pointers tests as SPIRVProducer tests #1300

Merged

rjodinchr pushed a commit that referenced this issue Feb 6, 2024

Rewrite variable pointers tests as SPIRVProducer tests (#1300)

54e49d8

refs #1292 * Rewrite as more targeted tests to reduce potential perturbations

rjodinchr mentioned this issue Feb 7, 2024

upgrade GEPs of PHI when possible #1302

Merged

alan-baker mentioned this issue Feb 7, 2024

Rewrite UBO tests #1304

Merged

rjodinchr pushed a commit that referenced this issue Feb 8, 2024

Rewrite UBO tests (#1304)

5bc0eef

refs #1292 refs #1303 * Fixed a bug in remove unused arguments * attributes on parameters weren't handled parameters * rewrite tests to be more targeted at the actual feature

alan-baker added a commit to alan-baker/clspv that referenced this issue Feb 8, 2024

Fix remaining XFAILs

53787fd

ref google#1292 * Update tests to remove XFAILs

alan-baker mentioned this issue Feb 8, 2024

Fix remaining XFAILs for #1292 #1306

Merged

rjodinchr pushed a commit that referenced this issue Feb 9, 2024

Fix remaining XFAILs (#1306)

e7a0c9b

ref #1292 * Update tests to remove XFAILs

BukeBeyond mentioned this issue Mar 16, 2024

ReplacePointerBitcastPass introduces invalid IR with struct field of a struct #1270

Closed

BukeBeyond mentioned this issue Mar 21, 2024

Clspv Fragments access to global memory by the Smallest access size #1329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonicalization of GEPs to i8 #1292

Canonicalization of GEPs to i8 #1292

alan-baker commented Jan 29, 2024 •

edited

Loading

alan-baker commented Jan 30, 2024

rjodinchr commented Feb 1, 2024

alan-baker commented Feb 5, 2024 •

edited

Loading

alan-baker commented Feb 5, 2024

dneto0 commented Feb 9, 2024

alan-baker commented Feb 9, 2024

BukeBeyond commented Mar 16, 2024 •

edited

Loading

rjodinchr commented Mar 17, 2024

BukeBeyond commented Mar 17, 2024 •

edited

Loading

BukeBeyond commented Mar 18, 2024

Canonicalization of GEPs to i8 #1292

Canonicalization of GEPs to i8 #1292

Comments

alan-baker commented Jan 29, 2024 • edited Loading

alan-baker commented Jan 30, 2024

rjodinchr commented Feb 1, 2024

alan-baker commented Feb 5, 2024 • edited Loading

alan-baker commented Feb 5, 2024

dneto0 commented Feb 9, 2024

alan-baker commented Feb 9, 2024

BukeBeyond commented Mar 16, 2024 • edited Loading

rjodinchr commented Mar 17, 2024

BukeBeyond commented Mar 17, 2024 • edited Loading

BukeBeyond commented Mar 18, 2024

alan-baker commented Jan 29, 2024 •

edited

Loading

alan-baker commented Feb 5, 2024 •

edited

Loading

BukeBeyond commented Mar 16, 2024 •

edited

Loading

BukeBeyond commented Mar 17, 2024 •

edited

Loading