Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
GPU autoscheduling with Mullapdui2016: the reference implementation
Reverse engineer the GPU scheduling feature as stated in Section 5.4 of Mullapudi's article: Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4), 83pp 1–11 https://doi.org/10.1145/2897824.2925952 When `target=cuda` is detected in the code generator command line arguments, intercept all `vectorize`, `parallel` scheduling calls requested by the auto-vectorization algorithm and the auto-parallelization algo with the class `GPUTilingDedup` for deferred execution. Implement the class `GPUTilingDedup` to ensure all Halide gpu schedule calls are idempotent: no matter how many times the Stage is vectorized, reordered, and then repeated `vectorized, the `gpu_threads()` is called exactly once. Also, intercept all `split` and `reorder` scheduling calls by Mullapudi's auto-splitting algorithm. Implement the clss `GPUTileHelper` to enforce atomic tranaction of the gpu schedules. If the current stage is `compute_root`, mark all auto-split inner dimensions as `gpu_threads`, and outer dimensions as `gpu_blocks`. If the Stage is `compute_at` another Stage, mark all `vectorize` dimensions as `gpu_threads`. If auto-splitting of the current stage does not result in any tile, implement a rudimentary tiling having tile size = vector_length x parallel_factor. If Mullapudi does not call any split, vectorize, or parallel schedules, assume scalar reduction routine. Implement it on the GPU via `single_thread`.
- Loading branch information