-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Shapeless convolution: groundwork for Stream-K support #2247
base: main
Are you sure you want to change the base?
Conversation
ba3d3ea
to
0efc119
Compare
0efc119
to
c3b1677
Compare
c3b1677
to
d38c36e
Compare
d38c36e
to
e5206fc
Compare
make test |
e5206fc
to
fd8a1db
Compare
make test |
// let y = (x + 1) } | ||
// let z = (y + 1) { | ||
// store(..., z) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right abstraction to use. The core issue is that the required variable scopes do not overlap in this nicely nested pattern, especially once we take into consider common subexpression elimination. For example, consider the following structure:
let tmp1 = x;
let tmp2 = y;
let tmp3 = x + y
// Only tmp3 after this point.
I think there is a more general method that we can use to accomplish a similar task. What if we add an unlet
operation to explicitly deallocate variables after their last use? Writing a pass to inject these operations should be fairly straightforward via Linear Scan. This would allow us to transform your example to
let x = 1
let y = x + 1
unlet(x)
let z = y + 1
unlet(y)
store(...,z)
unlet(z)
And now be able to handle the example I provided to
let tmp1 = x;
let tmp2 = y;
let tmp3 = x + y
unlet(tmp1)
unlet(tmp2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you on this - though I think it's not really related. What you describe is an IR limitation - we don't have a construct like unlet
to discard a variable at an arbitrary point.
The main benefit of the change in this PR is allowing to switch to more imperative JITting, similar to a high-level language. Otherwise we have to juggle with statements adding nesting here and there.
As for the above limitation, I think it's good to keep it in mind - to know that we have a way to reduce register pressure. But since this behavior has been there from the very beginning and it's not fixed yet - then maybe it's not that restricting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see what you are trying to accomplish, although I still don't think this is the right method. To some extent, I think our IR is unnecessarily complicated because we create a scope with every statement object. This is unnecessary and causes complications such is it being difficult to to reorder or inject statements and which is ultimately the reason for this injector.
As an alternative, we can represent scopes as as explicit IR object, and, moreover, this object already exists as stmt_seq_t
. If we did this, we remove the need for injectors like this and remove a mechanism for representing the equivalent IR tree in different ways. As a result, it will be simpler to create and transform the IR statements. The main reason we cannot do this is that our register resource management is tied to the implicit scopes. By introducing a deallocation statement, and a pass to inject these deallocations, we can remove this restriction.
I don't think doing this will be significantly more work than implementing this injector like this as well. It mostly boils down to implementing a (naive?) deallocation injector, and removing all the body
elements of the various statement objects. Since our IR already contains stmt_seq_t
, most IR passes should need minimal modification, We can even delay actually injecting deallocations until the final IR pass for simplicity. There might be a couple of nuances around counting registers for CSE, but I think it would be straightforward to generate a naive solution by doing something like:
get_grf_count(stmt) {
stmt = inject_dealloc(stmt)
grf_count_visitor(stmt)
}
although a fused injector/counter would be marginally faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjoursler I agree: perfectly this should be implemented in a different way. I'm documenting IR update/redesign plan (including for future architectures), added alloc/let update there.
But looking locally, this modification follows the injection mechanism done for alloc statements so it's not new. As we already have one for allocations, having another for let statements looks like a simple way to address the same limitation in IR. Redesigning IR in this regard is a more significant effort, IMO better done separately.
The PR includes a number of changes to prepare for Stream-K support:
inject_dangling_let_stmts()
pass to allow imperative IR generation (statement nesting is done automatically)kernel_iface_t
andvar_manager_t
to simplify access to kernel arguments (such as problem sizes and "magic" for integer division)