-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU autoscheduling with Mullapdui2016: the reference implementation #7787
base: main
Are you sure you want to change the base?
Conversation
db5703d
to
8683250
Compare
Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production. |
4bfdf3f
to
f195efa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production.
Hi @abadams ,
Thank you for reviewing the code, and dotting the i's and t's. I concur that GPU scheduling is an experimental feature, and should be highlighted as such in the user_warning
. Could you please show me where to warn the user?
I am also open to an additional option bool ArchParams::emit_gpu_schedules = false;
, parsable in the generator command line interface. Though, I highly doubt if anyone would go through the hassle of setting target=host-cuda-cuda_capability_??
just to disable GPU auto-scheduler.
My primary goal is get this PR upstreamed, so that everybody can benefit from the auto-scheduler comparison and other studies. The generated demo.schedule.h
can be sub-optimal; we all expect the end users will tweak it for their use cases.
As this is an attempted reconstruction of his GPU autoscheduler, I should probably tag @ravi-teja-mullapudi to see if this looks sane, because this will affect how people cite and compare to his work in future. |
Several bot failures with:
|
ba7257c
to
02199ca
Compare
Done removing the offending line. I also rebased the changes on top of Update: perhaps we need a separate PR to check for unused variables in the CMake configs: diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 47e90864d..83ded47a1 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -587,6 +587,8 @@ target_compile_options(
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-function>
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-macros>
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-parameter>
+ $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-variable>
+ $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-const-variable>
$<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat-pedantic>
$<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat> |
02199ca
to
f366498
Compare
@steven-johnson and @abadams , thank you for testing the PR on the CI. Yes, the failure is triggered by the CMake build option There are two types of generator failures:
But yeah, we should have a better exception handling mechanism for this actionable error. I need help to improve the user experience. Another generator failure: |
f366498
to
65a793c
Compare
Updated to main branch to fix OSX WebGPU failures |
9bd065d
to
3dcb5d4
Compare
Update: The GPU scheduling extension for Mullapudi2016 passes all Buildbot tests except for
@abadams Yeah I agreed the Buildbot CI jobs ensure production quality auto-schedulers, which is not original goal of the Mullapudi2016's GPU extensions. I will switch this PR to a draft, and work on issue 2 later next week. |
cb3eb57
to
a36d902
Compare
a36d902
to
06f7eeb
Compare
183e240
to
b1e89ce
Compare
This PR is a year old at this point -- is it defunct? |
Hi @steven-johnson this branch is still active. I actively use it to generate working schedules for imaging processing pipelines on GPUs like Jetson and RTX cards. I also rebase the branch to upstream's top of tree on a monthly basis. I recalled that we cannot merge it because it doesn't pass a few CI tests. There are test cases on GitHub actions where the following combinations results in autoscheduler exception:
Could you show me a way to skip these combos in GitHub actions? Mullapudi2016 is designed for camera ISP use cases, so it makes sense for the GPU reference implementation to report exceptions for deep learning algorithm pipelines. |
b1e89ce
to
e692617
Compare
int tile_inner_index = dims.size() - outer_dims.size() - 1; | ||
internal_assert(dims.size() >= outer_dims.size()); | ||
const auto tile_inner_index = std::max(int(dims.size() - outer_dims.size()) - 1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tensor multiplications, the assertion dims.size() > outer_dims.size()
fails because the GPU schedule wants to map outer dimensions to gpu_blocks
. If the assertion is ignored, integer tile_inner_index
becomes invalid (i.e. value less than zero).
Here, I am trying to clamp the tile_inner_index
to zero. This may break more test case. Help wanted here.
Reverse engineer the GPU scheduling feature as stated in Section 5.4 of Mullapudi's article: Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4), 83pp 1–11 https://doi.org/10.1145/2897824.2925952 When `target=cuda` is detected in the code generator command line arguments, intercept all `vectorize`, `parallel` scheduling calls requested by the auto-vectorization algorithm and the auto-parallelization algo with the class `GPUTilingDedup` for deferred execution. Implement the class `GPUTilingDedup` to ensure all Halide gpu schedule calls are idempotent: no matter how many times the Stage is vectorized, reordered, and then repeated `vectorized, the `gpu_threads()` is called exactly once. Also, intercept all `split` and `reorder` scheduling calls by Mullapudi's auto-splitting algorithm. Implement the clss `GPUTileHelper` to enforce atomic tranaction of the gpu schedules. If the current stage is `compute_root`, mark all auto-split inner dimensions as `gpu_threads`, and outer dimensions as `gpu_blocks`. If the Stage is `compute_at` another Stage, mark all `vectorize` dimensions as `gpu_threads`. If auto-splitting of the current stage does not result in any tile, implement a rudimentary tiling having tile size = vector_length x parallel_factor. If Mullapudi does not call any split, vectorize, or parallel schedules, assume scalar reduction routine. Implement it on the GPU via `single_thread`.
d36af45
to
8b38958
Compare
8b38958
to
de0a195
Compare
@steven-johnson Update: state of the CI jobs
I suspect Mullapudi2016 is not designed for algorithms that are not chained stencil pipelines. Is there a way to skip benchmarks on the CI runners based on CMake option Details of the failure when
And when
|
Rationale:
To compare the GPU auto-scheduling performance of
Mullapudi2016
againstLi2018
andAnderson2021
.To reduce the following claims to practice, quoting the original Mullapudi2016 article:
Mullapudi2016
andSioutas2020
algorithms, according to the findings in the Anderson2021 paper:Change summary:
Reverse engineer the GPU scheduling feature as stated in Section 5.4 of Mullapudi's article:
Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4), 83pp 1–11. https://doi.org/10.1145/2897824.2925952
When
target=cuda
is detected in the code generator command line arguments, intercept allvectorize
,parallel
scheduling calls requested by the auto-vectorization algorithm and the auto-parallelization algo with the classGPUTilingDedup
for deferred execution.Implement the class
GPUTilingDedup
to ensure all Halide gpu schedule calls are idempotent: no matter how many times the Stage isvectorized
,reordered
,parallel
, and thenreordered
again, thereorder
andgpu_threads()
schedules are called exactly once.Also, intercept all
split
andreorder
scheduling calls by Mullapudi's auto-splitting algorithm.Implement the clss
GPUTileHelper
to enforce atomic transaction of the gpu schedules. If the current stage iscompute_root
, mark all auto-split inner dimensions asgpu_threads
, and outer dimensions asgpu_blocks
. If the Stage iscompute_at
another Stage, mark allvectorize
dimensions asgpu_threads
.If auto-splitting of the current stage does not result in any tile, implement a rudimentary tiling having tile size = vector_length x parallel_factor.
If Mullapudi does not call any split, vectorize, or parallel schedules, assume scalar reduction routine. Implement it on the GPU via
single_thread
.cc'ed @aekul , @jrk, @abadams .
See also: #7491