[GPU] Shapeless convolution: groundwork for Stream-K support #2247

echeresh · 2024-12-11T02:10:08Z

The PR includes a number of changes to prepare for Stream-K support:

Added an IR statement for a "while" loop
Added SLM-based reorder pass
Added inject_dangling_let_stmts() pass to allow imperative IR generation (statement nesting is done automatically)
Extended v2 convolution loop nest to support dynamic bounds
Ported zero-out kernel to use kernel descriptor/params to allow to reuse it
Introduced kernel_iface_t and var_manager_t to simplify access to kernel arguments (such as problem sizes and "magic" for integer division)
Introduced "primitive execution/creation plan" to create/execute kernels in a unified manner
Added handling of epilogue tiles in a separate IR builder to reduce register usage

echeresh · 2024-12-12T20:37:10Z

make test
disable device_cpu
enable device_gpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_conv
enable benchdnn_deconv
enable benchdnn_reorder
enable benchdnn_sum
enable arch_xe2-lpg
enable arch_xe-hpg
enable arch_xe-hpc

echeresh · 2024-12-13T01:16:51Z

make test
disable device_cpu
enable device_gpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_conv
enable benchdnn_deconv
enable benchdnn_reorder
enable benchdnn_sum
enable arch_xe2-lpg
enable arch_xe-hpg
enable arch_xe-hpc

rjoursler · 2024-12-13T20:54:56Z

src/gpu/intel/jit/ir/ir.hpp

+//       let y = (x + 1) }
+//         let z = (y + 1) {
+//           store(..., z)
+//         }


I don't think this is the right abstraction to use. The core issue is that the required variable scopes do not overlap in this nicely nested pattern, especially once we take into consider common subexpression elimination. For example, consider the following structure:

let tmp1 = x; let tmp2 = y; let tmp3 = x + y // Only tmp3 after this point.

I think there is a more general method that we can use to accomplish a similar task. What if we add an unlet operation to explicitly deallocate variables after their last use? Writing a pass to inject these operations should be fairly straightforward via Linear Scan. This would allow us to transform your example to

let x = 1 let y = x + 1 unlet(x) let z = y + 1 unlet(y) store(...,z) unlet(z)

And now be able to handle the example I provided to

let tmp1 = x; let tmp2 = y; let tmp3 = x + y unlet(tmp1) unlet(tmp2)

I agree with you on this - though I think it's not really related. What you describe is an IR limitation - we don't have a construct like unlet to discard a variable at an arbitrary point.

The main benefit of the change in this PR is allowing to switch to more imperative JITting, similar to a high-level language. Otherwise we have to juggle with statements adding nesting here and there.

As for the above limitation, I think it's good to keep it in mind - to know that we have a way to reduce register pressure. But since this behavior has been there from the very beginning and it's not fixed yet - then maybe it's not that restricting.

Ok, I see what you are trying to accomplish, although I still don't think this is the right method. To some extent, I think our IR is unnecessarily complicated because we create a scope with every statement object. This is unnecessary and causes complications such is it being difficult to to reorder or inject statements and which is ultimately the reason for this injector.

As an alternative, we can represent scopes as as explicit IR object, and, moreover, this object already exists as stmt_seq_t. If we did this, we remove the need for injectors like this and remove a mechanism for representing the equivalent IR tree in different ways. As a result, it will be simpler to create and transform the IR statements. The main reason we cannot do this is that our register resource management is tied to the implicit scopes. By introducing a deallocation statement, and a pass to inject these deallocations, we can remove this restriction.

I don't think doing this will be significantly more work than implementing this injector like this as well. It mostly boils down to implementing a (naive?) deallocation injector, and removing all the body elements of the various statement objects. Since our IR already contains stmt_seq_t, most IR passes should need minimal modification, We can even delay actually injecting deallocations until the final IR pass for simplicity. There might be a couple of nuances around counting registers for CSE, but I think it would be straightforward to generate a naive solution by doing something like:

get_grf_count(stmt) { stmt = inject_dealloc(stmt) grf_count_visitor(stmt) }

although a fused injector/counter would be marginally faster.

@rjoursler I agree: perfectly this should be implemented in a different way. I'm documenting IR update/redesign plan (including for future architectures), added alloc/let update there.
But looking locally, this modification follows the injection mechanism done for alloc statements so it's not new. As we already have one for allocations, having another for let statements looks like a simple way to address the same limitation in IR. Redesigning IR in this regard is a more significant effort, IMO better done separately.

echeresh added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Dec 11, 2024

echeresh requested a review from a team as a code owner December 11, 2024 02:10

echeresh force-pushed the echeresh/streamk-groundwork branch from ba3d3ea to 0efc119 Compare December 11, 2024 18:45

echeresh requested a review from a team as a code owner December 11, 2024 18:45

echeresh changed the base branch from echeresh/metadata to main December 11, 2024 18:45

echeresh force-pushed the echeresh/streamk-groundwork branch from 0efc119 to c3b1677 Compare December 12, 2024 00:08

echeresh changed the title ~~WIP: [GPU] Shapeless convolution: groundwork for Stream-K support~~ [GPU] Shapeless convolution: groundwork for Stream-K support Dec 12, 2024

echeresh force-pushed the echeresh/streamk-groundwork branch from c3b1677 to d38c36e Compare December 12, 2024 20:35

echeresh added 10 commits December 12, 2024 12:36

xe: conv: remove unused resource

6cb33e6

xe: ir: simplify: skip ternary ops in nary op form

0746d9a

xe: ir: support initial offsets in split_to_linear()

8ec6035

xe: ir: add missing type handling

9cbc794

xe: ir: add min/max expression functions

2ce1fb9

xe: jit: do not inject send headers if already present

cfeec7d

xe: jit: introduce while IR statement

6f04543

xe: jit: codegen: allow unused kernel arguments

a839a10

xe: conv_v2: bridge: add layout/grid convertors

f850e93

xe: jit_v2: add atomic_fadd support

da65df2

echeresh force-pushed the echeresh/streamk-groundwork branch from d38c36e to e5206fc Compare December 12, 2024 20:37

echeresh added 10 commits December 12, 2024 16:52

xe: jit: refactor kernel_info

7c6d0d1

xe: ir: introduce inject_dangling_let_stmts()

ed7907d

xe: conv_v2: refactor loop_nest

bb4a7ab

xe: conv: port zero_out kernel to reusable abstractions

0cbe99b

xe: conv_v2: introduce primitive plan and var manager

bb76582

xe: conv_v2: reduce GRF usage with epilogue

2bc97e9

xe: conv_v2: styling: bia -> bias for consistency

095d787

xe: conv_v2: add GRF reorder via SLM support

41a1d45

xe: conv_v2: fix loop order for backward by weights

c863a15

xe: conv_v2: add post-op layout check

fd8a1db

echeresh force-pushed the echeresh/streamk-groundwork branch from e5206fc to fd8a1db Compare December 13, 2024 01:16

rjoursler reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Shapeless convolution: groundwork for Stream-K support #2247

[GPU] Shapeless convolution: groundwork for Stream-K support #2247

echeresh commented Dec 11, 2024 •

edited

Loading

echeresh commented Dec 12, 2024

echeresh commented Dec 13, 2024

rjoursler Dec 13, 2024 •

edited

Loading

echeresh Dec 13, 2024

rjoursler Dec 16, 2024 •

edited

Loading

echeresh Dec 18, 2024

[GPU] Shapeless convolution: groundwork for Stream-K support #2247

Are you sure you want to change the base?

[GPU] Shapeless convolution: groundwork for Stream-K support #2247

Conversation

echeresh commented Dec 11, 2024 • edited Loading

echeresh commented Dec 12, 2024

echeresh commented Dec 13, 2024

rjoursler Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

echeresh Dec 13, 2024

Choose a reason for hiding this comment

rjoursler Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

echeresh Dec 18, 2024

Choose a reason for hiding this comment

echeresh commented Dec 11, 2024 •

edited

Loading

rjoursler Dec 13, 2024 •

edited

Loading

rjoursler Dec 16, 2024 •

edited

Loading