trying out halide with matrix multiplication .. how did you ever come up with a good schedule? #6254
-
Hi everyone, As a basis of my first experiments i chose the matrix multiplication because i have seen some efforts to optimize it on the GPU with CUDA and have a feeling for how much potential there is.
My baseline is a naive implementation which takes on my (virtual) machine about 2.2 s for the multiplication of two 1024^2 matrices (compiled with -O2). From the tutorials, videos etc. i was under the impression that i can start easily with just the algo (-> default scheduling) and then try out different scheduling operations on it. When running the default scheduling of Halide i get very similar runtimes to the naive implementation and the code printed from HL_DEBUG_CODEGEN=1 is
Now i wanted to start the scheduling operations, first vectorize in k (since i would have expected k to be the most inner loop which is also how i understand the generated code output above).
But when i run this i get the error:
I am not sure what the error message really means. Yes, k is not a dimension but a domain as far as i understand the nomenclature, but still it sounds reasonable to vectorize over it since it is the index of the most inner loop? Since i understood that vectorize as used above is a short term for split + vectorize, i also tried to split explicitely in k, but then i get the bit more specific error:
Next thought was, that the tiling operation might be a nice way to go, it could improve cache usage and also i wanted to try to make use of the parallelization over tiles in the next step. But when i tried the tiling operation
And inspected the generated code i could only see that the initialization (set to zero) looks to be processed in tiled fashion, whereas the actual computation still is exactly as before:
Execution time got a bit worse (2.4 s) but anyhow it does not look right.
somehow consistently, parallel does only affect the initialization:
Without much persuation i also tried to vectorize in x dimension:
I at least does not give an error but also does not affect performance and again only seems to influence the initialization phase:
Now at this time i am a bit crestfallen since i am not sure what is going on, and how anybody ever comes up with a good schedule :-) My last resort was to just try out the performance test schedule from https://github.com/halide/Halide/blob/master/test/performance/matrix_multiplication.cpp and indeed this one works and leads to astonishing performance (0.04 s). But this schedule is really obscure to me:
I attach my source code where all the trials are separate commented sections, maybe someone is willing to take a look |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Specifically about this: you need to refer to the update stage while scheduling. |
Beta Was this translation helpful? Give feedback.
-
Ashish answered above about having to schedule the In the performance example, vectorization within The performance example uses It is probably better to think of Halide as a collection of tools for automating transformations that would be very tedious by hand rather than a very high level of abstraction. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for your answers, thanks @zvookin for your detailed explanations. The update stage was the part which i was missing and with your answer i understand why. Actually i now added an explicit initialization to 0.0f in my code just to make clear that two stages are there. Its nice that Halide automatically did it when it "spotted" my += operator, on the other hand it is a bit to auto-magical to me, even a warning "uninitialized buffer used" would have made sense maybe. Now with the update(0) call the scheduling operations have an effect, very simple tiling + parallel + vectorize(x) already brings a 10x improvement, which is extremely nice for the little amount of work required from me. Still i have issues to understand the performance example, "it sort of dances around the reduction loop in the middle" is a lovely description, i guess i have to meditate a bit on it to let it sink in with all the details :-D
I agree, and i am totally looking for such tools. But on the other hand it is just the same: "convenient tools to automate" basically are an abstraction and they operate on an abstraction like the Func objects. But that is just my current impression, only looking into Halide few days now, definitely not having enough insights to have a well based opinion. |
Beta Was this translation helpful? Give feedback.
Ashish answered above about having to schedule the
update(N)
stage. Reductions have an initialization or "pure" stage and then some number of updates that mutably change the buffer backing the result. Each of these stages can scheduled independently. the downside is scheduling must be applied to each separately. This is likely the main stumbling block here and I expect things will make more sense knowing about this.In the performance example, vectorization within
matrix_mul
is accomplished by reordering a non-reduction dimension into the innermost loop. (To vectorize reductions,rfactor
must be used. Reductions are specified as in order loops and thus to reorder operations, it must be pr…