-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generic: sycl: Introduce spec constants for sycl matmul #2250
base: main
Are you sure you want to change the base?
Conversation
#ifdef DNNL_SYCL_GENERIC | ||
// skip unimplemented configs for sycl impl | ||
|| is_gpu() | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed the long term goal. As the generic backend is still a work in progress and new features are being added to oneDNN it is useful to be able to skip some configurations that are not supported yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, this has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it a bit prematurely, its added back in until a conclusion is reached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any unimplemented hits in current validation. It makes me think if there are any new unimplemented cases, they must be local to these changes. I can provide more guidance if there's a pointer on or a snapshot of what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change running the oneDNN compiled with generic vendor (sycl ref matmul used) causes benchdnn to fail with test unimplemented error when testing with runtime dims, spec constants require kernel to be compiled for the specific spec constants (specific sizes for src/dst/weights) so current approach does now support runtime dims.
We are wondering if this can should be reported as skipped by benchdnn, as it is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds to me like a matmul-specific skip.
I propose to introduce is_generic_gpu()
method identically to is_nvidia_gpu()
, and put a check here (before the if() {} section).
The change in this PR will silence ALL drivers with ANY unimplemented hit which is not the desired behavior in a long term.
The practice to mark unimplemented as skipped is to figure out the minimal scope all of them fit and skip only for it. In your case is sounds like any runtime dimensions involved. You may relate to this statement when crafting it
src/xpu/sycl/types.hpp
Outdated
@@ -146,6 +177,23 @@ struct md_t { | |||
#undef CHECK_AND_ASSIGN | |||
} | |||
|
|||
// This constructor is to be called inside the kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you control this copy constructor will be called only from the inside the kernel and not from other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this comment as this seems to be more misleading than helpful, the idea of the new constructor is to allow the spec constant struct to be convertible to the sycl version of the memory descriptor md_t
, not to discourage use outside of the kernel, which would be fine.
src/xpu/sycl/types.hpp
Outdated
@@ -146,6 +177,23 @@ struct md_t { | |||
#undef CHECK_AND_ASSIGN | |||
} | |||
|
|||
// This constructor is to be called inside the kernel | |||
md_t(md_t_spec_const &mimicker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md_t(md_t_spec_const &mimicker) { | |
md_t(const md_t_spec_const &mimicker) : data_type_(...), ... { | |
for () { | |
... | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the suggested change
dim32_t inner_nblks_; | ||
dims32_t inner_blks_; | ||
dims32_t inner_idxs_; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This struct cannot have a non trivial constructor, or any non trivial types, because...
// why exactly?
static_assert(std::is_trivial_v<md_t_spec_const>, "md_t_spec_const must be trivial");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this a bit more DPC++ defines SYCL_DEVICE_COPYABLE=1
so we may be able to avoid the need of md_t_spec_const
if we specialize is_device_copyable_v<md_t>
to return true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the knowledge not available from reading the code better be put in a comment as a reference with exact spot to look or as a summary.
a7eed22
to
55e83bd
Compare
Address comments Aaand its back!
55e83bd
to
ac9c4aa
Compare
@@ -52,16 +55,56 @@ void ref_matmul_t::pd_t::init_conf() { | |||
for (const auto &mdw : {src_d, weights_d, dst_d, bias_d}) { | |||
if (mdw.has_runtime_dims()) { | |||
any_runtime_params_ = true; | |||
return; | |||
return status::unimplemented; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) please don't hesitate to use verbose macros for this kind of checks.
memory_desc_wrapper mdw(md); | ||
|
||
assert(mdw.format_kind() == format_kind::blocked); | ||
assert(mdw.ndims() <= max_dims); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this function can fail, I would advocate to return a status instead of using asserts, and propagate by it in init_conf.
parallel_for(ctx, kernel_, [&](::sycl::handler &cgh) { | ||
matmul_kernel_fwd_t matmul_kernel(conf, cgh, ctx); | ||
matmul_kernel_fwd_t matmul_kernel(pd()->conf_, cgh, ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, using specialization constants will trigger recompilation during execute function.
TBH, I am a bit bothered by the precedent this would set, as multiple users rely on the property that after primitive creation, there should be no recurrent jitting overheads. Are we sure there is no other venue to avoid the argument size limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, we discussed that the spec constant would only trigger recompilation if we supported runtime dimensions. We discussed it does not make sense to support runtime dimensions so in the current patch the kernel compilation should only happen in the init
function.
From what I gathered there are no great solutions in the short term:
- Reducing the amount of arguments given to the kernel. We already use a struct as an argument to only give the arguments that may be used in the SYCL kernels.
- Reducing the number of post-op supported: we already support "only" 5 post-ops. I wouldn't mind reducing this further but this may disable more tests. My understanding was that we didn't want to reduce this further.
The long term solution is to use SYCL-RTC but this will be a large task. The spec constants are useful to be able to run more tests with the SYCL backend until SYCL-RTC is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, we discussed that the spec constant would only trigger recompilation if we supported runtime dimensions. We discussed it does not make sense to support runtime dimensions
I guess I missed that part. Given that this is a reference implementation, I would expect that at some point it would support runtime dimensions. Solving the current argument size limitation issue with specialization_constant would defer runtime dimension support until SYCL-RTC is enabled. Is that something we are fine with?
Reducing the amount of arguments given to the kernel. We already use a struct as an argument to only give the arguments that may be used in the SYCL kernels.
IIUC, sycl_matmul_conf_t is used to pass argument shapes to the kernel right?
If so, it seems to use sycl_post_ops_t which is a union type, for which the biggest members are prelu_post_op with ~100 Bytes and binary_post_op with ~160 Bytes. So 5 post_ops gets us to ~800B, do I get that right?
(which is about 1/4 of argument size limit).
In any case, here are a few suggestions:
- for prelu_post_op, using dim_t[xpu::sycl::md_t::max_dims] instead of dims_t for the strides array would half that size.
- for binary_post_op, we could use a couple of arrays for dims and strides instead of using md_t?
- In sycl::md_t, we have about 50B related to blocking (inner_block, here). Do we actually use blocked layouts in generic sycl kernels? If not that would shape about 250B from post_ops, and ~500B from matmul_conf.
- separate post-ops that require extra md from the others, so that we can reduce the number of post-ops that require passing md to kernel (e.g. binary/prelu), while not restricting others (e.g. eltwise).
Reducing the number of post-op supported: we already support "only" 5 post-ops. I wouldn't mind reducing this further but this may disable more tests. My understanding was that we didn't want to reduce this further.
Correct 5 is already low. However it seems we are using a union for all postops. Splitting eltwise from binary/prelu postop could help limit only the number of binary/prelu post-ops (e.g. to 2 binary/prelu post-ops), while not restricting the number of eltwise post-ops. That could be an option as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed suggestion. From previous investigations it did not seem possible to reduce enough the arguments size but we could investigate again. I'm not confident we can remove enough members from the binary post op.
Just one clarification, the argument size limit is 2kB so the 5 post-ops would represents about 40% of the total size. We need to shave off a total of 282B or at least 57B per post-op. I haven't looked into this myself but this seems difficult just for the binary post op which is apparently the biggest issue here.
Solving the current argument size limitation issue with specialization_constant would defer runtime dimension support until SYCL-RTC is enabled. Is that something we are fine with?
This sounds fine to me. From what I understand runtime dimensions are not that important if they are just meant for optimization purposes but the SYCL backend is not optimized yet. At this point it is easier to use spec-constant. This is just a workaround for us to be able to run tests and have a clearer picture of what operations are working as expected.
make test |
@@ -409,16 +410,23 @@ struct matmul_kernel_fwd_t { | |||
CTX_IN_SYCL_KERNEL_MEMORY(DNNL_ARG_ATTR_DROPOUT_PROBABILITY)) | |||
, po_args_(cgh, ctx, conf_.post_ops) {} | |||
|
|||
void operator()(::sycl::nd_item<1> item) const { | |||
void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const { | |
void operator()(::sycl::nd_item<1> item, const ::sycl::kernel_handler &kh) const { |
?
auto md_t_spec_const_pod_val = kh.get_specialization_constant< | ||
detail::matmul::md_t_spec_const_id>(); | ||
auto data_md = md_t_spec_const_pod_val.data_md_t; | ||
auto weights_md = md_t_spec_const_pod_val.weights_md_t; | ||
auto dst_md = md_t_spec_const_pod_val.dst_md_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto &
to remove unnecessary copies?
auto weights_md = md_t_spec_const_pod_val.weights_md_t; | ||
auto dst_md = md_t_spec_const_pod_val.dst_md_t; | ||
|
||
memory_tensor_t data_mem(data_, data_md); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: for matmul it's called src
, not data
.
init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d, | ||
weights_d, dst_d, bias_d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d, | |
weights_d, dst_d, bias_d); | |
CHECK(init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d, | |
weights_d, dst_d, bias_d)); |
auto init_status = init_conf(); | ||
|
||
return init_status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto init_status = init_conf(); | |
return init_status; | |
return init_conf(); |
@@ -146,6 +177,21 @@ struct md_t { | |||
#undef CHECK_AND_ASSIGN | |||
} | |||
|
|||
md_t(const md_t_spec_const &mimicker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
md_t(const md_t_spec_const &mimicker) | |
md_t(const md_t_spec_const &other) |
Description
Currently the sycl implementation of matmul has a kernel argument size error when run on intel hardware, the size of the kernel arguments could not be trivially reduced, requiring a more complex solution. This PR introduces a fix for this by introducing spec constants for the source, weights and destination memory descriptors. This fix should be temporary while the best solution would be to leverage SYCL RTC to handle the argument size error, when available.
Additionally, due to the kernels being compiled for the spec constants we cannot have the values of source, weights and destination md change. This would mean that the sycl implementation cannot support runtime dimensions currently.
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?