Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic: sycl: Introduce spec constants for sycl matmul #2250

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ShanoToni
Copy link
Contributor

Description

Currently the sycl implementation of matmul has a kernel argument size error when run on intel hardware, the size of the kernel arguments could not be trivially reduced, requiring a more complex solution. This PR introduces a fix for this by introducing spec constants for the source, weights and destination memory descriptors. This fix should be temporary while the best solution would be to leverage SYCL RTC to handle the argument size error, when available.

Additionally, due to the kernels being compiled for the spec constants we cannot have the values of source, weights and destination md change. This would mean that the sycl implementation cannot support runtime dimensions currently.

Checklist

General

  • [ x ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • [ x ] Have you formatted the code using clang-format?

@ShanoToni ShanoToni requested review from a team as code owners December 11, 2024 17:11
@github-actions github-actions bot added the platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic label Dec 11, 2024
#ifdef DNNL_SYCL_GENERIC
// skip unimplemented configs for sycl impl
|| is_gpu()
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the expectation was generic kernels cover oneDNN features completely, thus, all unimplemented cases should be properly reported by benchdnn and addressed by developers. Has this been changed?

@mgouicem, @vpirogov

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed the long term goal. As the generic backend is still a work in progress and new features are being added to oneDNN it is useful to be able to skip some configurations that are not supported yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, this has been removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it a bit prematurely, its added back in until a conclusion is reached

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any unimplemented hits in current validation. It makes me think if there are any new unimplemented cases, they must be local to these changes. I can provide more guidance if there's a pointer on or a snapshot of what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change running the oneDNN compiled with generic vendor (sycl ref matmul used) causes benchdnn to fail with test unimplemented error when testing with runtime dims, spec constants require kernel to be compiled for the specific spec constants (specific sizes for src/dst/weights) so current approach does now support runtime dims.
We are wondering if this can should be reported as skipped by benchdnn, as it is expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds to me like a matmul-specific skip.
I propose to introduce is_generic_gpu() method identically to is_nvidia_gpu(), and put a check here (before the if() {} section).

The change in this PR will silence ALL drivers with ANY unimplemented hit which is not the desired behavior in a long term.
The practice to mark unimplemented as skipped is to figure out the minimal scope all of them fit and skip only for it. In your case is sounds like any runtime dimensions involved. You may relate to this statement when crafting it

@@ -146,6 +177,23 @@ struct md_t {
#undef CHECK_AND_ASSIGN
}

// This constructor is to be called inside the kernel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you control this copy constructor will be called only from the inside the kernel and not from other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this comment as this seems to be more misleading than helpful, the idea of the new constructor is to allow the spec constant struct to be convertible to the sycl version of the memory descriptor md_t, not to discourage use outside of the kernel, which would be fine.

@@ -146,6 +177,23 @@ struct md_t {
#undef CHECK_AND_ASSIGN
}

// This constructor is to be called inside the kernel
md_t(md_t_spec_const &mimicker) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
md_t(md_t_spec_const &mimicker) {
md_t(const md_t_spec_const &mimicker) : data_type_(...), ... {
for () {
...
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the suggested change

dim32_t inner_nblks_;
dims32_t inner_blks_;
dims32_t inner_idxs_;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// This struct cannot have a non trivial constructor, or any non trivial types, because...
// why exactly?
static_assert(std::is_trivial_v<md_t_spec_const>, "md_t_spec_const must be trivial");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the sycl specification HERE we need the spec constant to be device_copyable HERE.

I can clarify this in the comment if you think that makes sense, or remove it all together?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this a bit more DPC++ defines SYCL_DEVICE_COPYABLE=1 so we may be able to avoid the need of md_t_spec_const if we specialize is_device_copyable_v<md_t> to return true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the knowledge not available from reading the code better be put in a comment as a reference with exact spot to look or as a summary.

@ShanoToni ShanoToni force-pushed the atharva/sycl_matmul_conf_size branch from a7eed22 to 55e83bd Compare December 12, 2024 15:23
@ShanoToni ShanoToni force-pushed the atharva/sycl_matmul_conf_size branch from 55e83bd to ac9c4aa Compare December 12, 2024 15:55
@@ -52,16 +55,56 @@ void ref_matmul_t::pd_t::init_conf() {
for (const auto &mdw : {src_d, weights_d, dst_d, bias_d}) {
if (mdw.has_runtime_dims()) {
any_runtime_params_ = true;
return;
return status::unimplemented;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) please don't hesitate to use verbose macros for this kind of checks.

memory_desc_wrapper mdw(md);

assert(mdw.format_kind() == format_kind::blocked);
assert(mdw.ndims() <= max_dims);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this function can fail, I would advocate to return a status instead of using asserts, and propagate by it in init_conf.

parallel_for(ctx, kernel_, [&](::sycl::handler &cgh) {
matmul_kernel_fwd_t matmul_kernel(conf, cgh, ctx);
matmul_kernel_fwd_t matmul_kernel(pd()->conf_, cgh, ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, using specialization constants will trigger recompilation during execute function.
TBH, I am a bit bothered by the precedent this would set, as multiple users rely on the property that after primitive creation, there should be no recurrent jitting overheads. Are we sure there is no other venue to avoid the argument size limitation?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow, we discussed that the spec constant would only trigger recompilation if we supported runtime dimensions. We discussed it does not make sense to support runtime dimensions so in the current patch the kernel compilation should only happen in the init function.
From what I gathered there are no great solutions in the short term:

  • Reducing the amount of arguments given to the kernel. We already use a struct as an argument to only give the arguments that may be used in the SYCL kernels.
  • Reducing the number of post-op supported: we already support "only" 5 post-ops. I wouldn't mind reducing this further but this may disable more tests. My understanding was that we didn't want to reduce this further.

The long term solution is to use SYCL-RTC but this will be a large task. The spec constants are useful to be able to run more tests with the SYCL backend until SYCL-RTC is implemented.

Copy link
Contributor

@mgouicem mgouicem Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow, we discussed that the spec constant would only trigger recompilation if we supported runtime dimensions. We discussed it does not make sense to support runtime dimensions

I guess I missed that part. Given that this is a reference implementation, I would expect that at some point it would support runtime dimensions. Solving the current argument size limitation issue with specialization_constant would defer runtime dimension support until SYCL-RTC is enabled. Is that something we are fine with?

Reducing the amount of arguments given to the kernel. We already use a struct as an argument to only give the arguments that may be used in the SYCL kernels.

IIUC, sycl_matmul_conf_t is used to pass argument shapes to the kernel right?
If so, it seems to use sycl_post_ops_t which is a union type, for which the biggest members are prelu_post_op with ~100 Bytes and binary_post_op with ~160 Bytes. So 5 post_ops gets us to ~800B, do I get that right?
(which is about 1/4 of argument size limit).

In any case, here are a few suggestions:

  • for prelu_post_op, using dim_t[xpu::sycl::md_t::max_dims] instead of dims_t for the strides array would half that size.
  • for binary_post_op, we could use a couple of arrays for dims and strides instead of using md_t?
  • In sycl::md_t, we have about 50B related to blocking (inner_block, here). Do we actually use blocked layouts in generic sycl kernels? If not that would shape about 250B from post_ops, and ~500B from matmul_conf.
  • separate post-ops that require extra md from the others, so that we can reduce the number of post-ops that require passing md to kernel (e.g. binary/prelu), while not restricting others (e.g. eltwise).

Reducing the number of post-op supported: we already support "only" 5 post-ops. I wouldn't mind reducing this further but this may disable more tests. My understanding was that we didn't want to reduce this further.

Correct 5 is already low. However it seems we are using a union for all postops. Splitting eltwise from binary/prelu postop could help limit only the number of binary/prelu post-ops (e.g. to 2 binary/prelu post-ops), while not restricting the number of eltwise post-ops. That could be an option as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed suggestion. From previous investigations it did not seem possible to reduce enough the arguments size but we could investigate again. I'm not confident we can remove enough members from the binary post op.
Just one clarification, the argument size limit is 2kB so the 5 post-ops would represents about 40% of the total size. We need to shave off a total of 282B or at least 57B per post-op. I haven't looked into this myself but this seems difficult just for the binary post op which is apparently the biggest issue here.

Solving the current argument size limitation issue with specialization_constant would defer runtime dimension support until SYCL-RTC is enabled. Is that something we are fine with?

This sounds fine to me. From what I understand runtime dimensions are not that important if they are just meant for optimization purposes but the SYCL backend is not optimized yet. At this point it is easier to use spec-constant. This is just a workaround for us to be able to run tests and have a clearer picture of what operations are working as expected.

@mgouicem
Copy link
Contributor

make test
disable device_cpu
enable device_gpu
enable thr_cuda
enable thr_generic
enable arch_rtx

@@ -409,16 +410,23 @@ struct matmul_kernel_fwd_t {
CTX_IN_SYCL_KERNEL_MEMORY(DNNL_ARG_ATTR_DROPOUT_PROBABILITY))
, po_args_(cgh, ctx, conf_.post_ops) {}

void operator()(::sycl::nd_item<1> item) const {
void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const {
void operator()(::sycl::nd_item<1> item, const ::sycl::kernel_handler &kh) const {

?

Comment on lines +420 to +424
auto md_t_spec_const_pod_val = kh.get_specialization_constant<
detail::matmul::md_t_spec_const_id>();
auto data_md = md_t_spec_const_pod_val.data_md_t;
auto weights_md = md_t_spec_const_pod_val.weights_md_t;
auto dst_md = md_t_spec_const_pod_val.dst_md_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto & to remove unnecessary copies?

auto weights_md = md_t_spec_const_pod_val.weights_md_t;
auto dst_md = md_t_spec_const_pod_val.dst_md_t;

memory_tensor_t data_mem(data_, data_md);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for matmul it's called src, not data.

Comment on lines +61 to +62
init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d,
weights_d, dst_d, bias_d);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d,
weights_d, dst_d, bias_d);
CHECK(init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d,
weights_d, dst_d, bias_d));

Comment on lines +69 to +71
auto init_status = init_conf();

return init_status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto init_status = init_conf();
return init_status;
return init_conf();

@@ -146,6 +177,21 @@ struct md_t {
#undef CHECK_AND_ASSIGN
}

md_t(const md_t_spec_const &mimicker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
md_t(const md_t_spec_const &mimicker)
md_t(const md_t_spec_const &other)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants