Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Shapeless convolution: groundwork for Stream-K support #2247

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

echeresh
Copy link
Contributor

@echeresh echeresh commented Dec 11, 2024

The PR includes a number of changes to prepare for Stream-K support:

  • Added an IR statement for a "while" loop
  • Added SLM-based reorder pass
  • Added inject_dangling_let_stmts() pass to allow imperative IR generation (statement nesting is done automatically)
  • Extended v2 convolution loop nest to support dynamic bounds
  • Ported zero-out kernel to use kernel descriptor/params to allow to reuse it
  • Introduced kernel_iface_t and var_manager_t to simplify access to kernel arguments (such as problem sizes and "magic" for integer division)
  • Introduced "primitive execution/creation plan" to create/execute kernels in a unified manner
  • Added handling of epilogue tiles in a separate IR builder to reduce register usage

@echeresh echeresh added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Dec 11, 2024
@echeresh echeresh requested a review from a team as a code owner December 11, 2024 02:10
@echeresh echeresh force-pushed the echeresh/streamk-groundwork branch from ba3d3ea to 0efc119 Compare December 11, 2024 18:45
@echeresh echeresh requested a review from a team as a code owner December 11, 2024 18:45
@echeresh echeresh changed the base branch from echeresh/metadata to main December 11, 2024 18:45
@echeresh echeresh force-pushed the echeresh/streamk-groundwork branch from 0efc119 to c3b1677 Compare December 12, 2024 00:08
@echeresh echeresh changed the title WIP: [GPU] Shapeless convolution: groundwork for Stream-K support [GPU] Shapeless convolution: groundwork for Stream-K support Dec 12, 2024
@echeresh echeresh force-pushed the echeresh/streamk-groundwork branch from c3b1677 to d38c36e Compare December 12, 2024 20:35
@echeresh echeresh force-pushed the echeresh/streamk-groundwork branch from d38c36e to e5206fc Compare December 12, 2024 20:37
@echeresh
Copy link
Contributor Author

make test
disable device_cpu
enable device_gpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_conv
enable benchdnn_deconv
enable benchdnn_reorder
enable benchdnn_sum
enable arch_xe2-lpg
enable arch_xe-hpg
enable arch_xe-hpc

@echeresh echeresh force-pushed the echeresh/streamk-groundwork branch from e5206fc to fd8a1db Compare December 13, 2024 01:16
@echeresh
Copy link
Contributor Author

make test
disable device_cpu
enable device_gpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_conv
enable benchdnn_deconv
enable benchdnn_reorder
enable benchdnn_sum
enable arch_xe2-lpg
enable arch_xe-hpg
enable arch_xe-hpc

// let y = (x + 1) }
// let z = (y + 1) {
// store(..., z)
// }
Copy link
Contributor

@rjoursler rjoursler Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right abstraction to use. The core issue is that the required variable scopes do not overlap in this nicely nested pattern, especially once we take into consider common subexpression elimination. For example, consider the following structure:

let tmp1 = x;
let tmp2 = y;
let tmp3 = x + y
// Only tmp3 after this point.

I think there is a more general method that we can use to accomplish a similar task. What if we add an unlet operation to explicitly deallocate variables after their last use? Writing a pass to inject these operations should be fairly straightforward via Linear Scan. This would allow us to transform your example to

let x = 1
let y = x + 1
unlet(x)
let z = y + 1
unlet(y)
store(...,z)
unlet(z) 

And now be able to handle the example I provided to

let tmp1 = x;
let tmp2 = y;
let tmp3 = x + y
unlet(tmp1)
unlet(tmp2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on this - though I think it's not really related. What you describe is an IR limitation - we don't have a construct like unlet to discard a variable at an arbitrary point.

The main benefit of the change in this PR is allowing to switch to more imperative JITting, similar to a high-level language. Otherwise we have to juggle with statements adding nesting here and there.

As for the above limitation, I think it's good to keep it in mind - to know that we have a way to reduce register pressure. But since this behavior has been there from the very beginning and it's not fixed yet - then maybe it's not that restricting.

Copy link
Contributor

@rjoursler rjoursler Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you are trying to accomplish, although I still don't think this is the right method. To some extent, I think our IR is unnecessarily complicated because we create a scope with every statement object. This is unnecessary and causes complications such is it being difficult to to reorder or inject statements and which is ultimately the reason for this injector.

As an alternative, we can represent scopes as as explicit IR object, and, moreover, this object already exists as stmt_seq_t. If we did this, we remove the need for injectors like this and remove a mechanism for representing the equivalent IR tree in different ways. As a result, it will be simpler to create and transform the IR statements. The main reason we cannot do this is that our register resource management is tied to the implicit scopes. By introducing a deallocation statement, and a pass to inject these deallocations, we can remove this restriction.

I don't think doing this will be significantly more work than implementing this injector like this as well. It mostly boils down to implementing a (naive?) deallocation injector, and removing all the body elements of the various statement objects. Since our IR already contains stmt_seq_t, most IR passes should need minimal modification, We can even delay actually injecting deallocations until the final IR pass for simplicity. There might be a couple of nuances around counting registers for CSE, but I think it would be straightforward to generate a naive solution by doing something like:

get_grf_count(stmt) {
     stmt = inject_dealloc(stmt)
     grf_count_visitor(stmt)
}

although a fused injector/counter would be marginally faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjoursler I agree: perfectly this should be implemented in a different way. I'm documenting IR update/redesign plan (including for future architectures), added alloc/let update there.
But looking locally, this modification follows the injection mechanism done for alloc statements so it's not new. As we already have one for allocations, having another for let statements looks like a simple way to address the same limitation in IR. Redesigning IR in this regard is a more significant effort, IMO better done separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants