Skip to content

Commit

Permalink
GPU autoscheduling with Mullapdui2016: the reference implementation
Browse files Browse the repository at this point in the history
Reverse engineer the GPU scheduling feature as stated in Section 5.4 of
Mullapudi's article:

Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically
scheduling Halide image processing pipelines.
ACM Transactions on Graphics, 35(4), 83pp 1–11
https://doi.org/10.1145/2897824.2925952

When `target=cuda` is detected in the code generator command line
arguments, intercept all `vectorize`, `parallel` scheduling calls
requested by the auto-vectorization algorithm and the
auto-parallelization algo with the class `GPUTilingDedup` for deferred
execution.

Implement the class `GPUTilingDedup` to ensure all Halide gpu schedule
calls are idempotent: no matter how many times the Stage is vectorized,
reordered, and then repeated `vectorized, the `gpu_threads()` is called exactly once.

Also, intercept all `split` and `reorder` scheduling calls by
Mullapudi's auto-splitting algorithm.

Implement the clss `GPUTileHelper` to enforce atomic tranaction of the
gpu schedules. If the current stage is `compute_root`, mark all auto-split
inner dimensions as `gpu_threads`, and outer dimensions as `gpu_blocks`.
If the Stage is `compute_at` another Stage, mark all `vectorize`
dimensions as `gpu_threads`.

If auto-splitting of the current stage does not result in any tile,
implement a rudimentary tiling having tile size = vector_length x
parallel_factor.

If Mullapudi does not call any split, vectorize, or parallel schedules,
assume scalar reduction routine. Implement it on the GPU via
`single_thread`.
  • Loading branch information
antonysigma committed Aug 21, 2023
1 parent f11e80d commit 8683250
Showing 1 changed file with 505 additions and 53 deletions.
Loading

0 comments on commit 8683250

Please sign in to comment.