trying out halide with matrix multiplication .. how did you ever come up with a good schedule? #6254

jsteinhofff · 2021-09-21T07:17:24Z

jsteinhofff
Sep 21, 2021

Hi everyone,
being both fascinated ("separating algo from schedule is so powerful") as well as scared ("so much abstraction going on") by the concept of Halide i wanted to give it a try and see how it feels.

As a basis of my first experiments i chose the matrix multiplication because i have seen some efforts to optimize it on the GPU with CUDA and have a feeling for how much potential there is.

matrix_mul(x, y) += A(k, y) * B(x, k);

My baseline is a naive implementation which takes on my (virtual) machine about 2.2 s for the multiplication of two 1024^2 matrices (compiled with -O2).

From the tutorials, videos etc. i was under the impression that i can start easily with just the algo (-> default scheduling) and then try out different scheduling operations on it.

When running the default scheduling of Halide i get very similar runtimes to the naive implementation and the code printed from HL_DEBUG_CODEGEN=1 is

produce matrix_mul {
  let t10 = 0 - (matrix_mul.min.1*matrix_mul.stride.1)
  for (matrix_mul.s0.y.rebased, 0, matrix_mul.extent.1) {
   let t11 = ((matrix_mul.min.1 + matrix_mul.s0.y.rebased)*matrix_mul.stride.1) + t10
   for (matrix_mul.s0.x.rebased, 0, matrix_mul.extent.0) {
    matrix_mul[matrix_mul.s0.x.rebased + t11] = 0.000000f
   }
  }
  let t14 = (matrix_mul.min.0 - (B.min.1*B.stride.1)) - B.min.0
  let t12 = 0 - (matrix_mul.min.1*matrix_mul.stride.1)
  let t13 = (A.min.1*A.stride.1) + A.min.0
  for (matrix_mul.s1.y.rebased, 0, matrix_mul.extent.1) {
   let t17 = matrix_mul.min.1 + matrix_mul.s1.y.rebased
   let t16 = (A.stride.1*t17) - t13
   let t15 = (matrix_mul.stride.1*t17) + t12
   for (matrix_mul.s1.x.rebased, 0, matrix_mul.extent.0) {
    let t18 = matrix_mul.s1.x.rebased + t15
    let t19 = matrix_mul.s1.x.rebased + t14
    for (matrix_mul.s1.k$x, 0, 1024) {
     matrix_mul[t18] = matrix_mul[t18] + (A[matrix_mul.s1.k$x + t16]*B[(B.stride.1*matrix_mul.s1.k$x) + t19])
    }
   }
  }
 }

Now i wanted to start the scheduling operations, first vectorize in k (since i would have expected k to be the most inner loop which is also how i understand the generated code output above).

mm.matrix_mul.vectorize(mm.k, 4);

But when i run this i get the error:

User error triggered at /media/tje2abt/data/Halide/src/Func.cpp:1007
Error:
In schedule for matrix_mul, could not find split dimension: k$x
Vars: x y __outermost
Aborted (core dumped)

I am not sure what the error message really means. Yes, k is not a dimension but a domain as far as i understand the nomenclature, but still it sounds reasonable to vectorize over it since it is the index of the most inner loop? Since i understood that vectorize as used above is a short term for split + vectorize, i also tried to split explicitely in k, but then i get the bit more specific error:

User error triggered at /media/tje2abt/data/Halide/src/Func.cpp:1120
Condition failed: outer.is_rvar
Error:
Can't split RVar k$x into Var v4
Aborted (core dumped)

Next thought was, that the tiling operation might be a nice way to go, it could improve cache usage and also i wanted to try to make use of the parallelization over tiles in the next step. But when i tried the tiling operation

Var xi, yi, xo, yo;
mm.matrix_mul.tile(mm.x, mm.y, xo, yo, xi, yi, 4, 4);

And inspected the generated code i could only see that the initialization (set to zero) looks to be processed in tiled fashion, whereas the actual computation still is exactly as before:

 produce matrix_mul {
  let t15 = (matrix_mul.extent.1 + 3)/4
  let t16 = (matrix_mul.extent.0 + 3)/4
  let t17 = 0 - (matrix_mul.min.1*matrix_mul.stride.1)
  for (matrix_mul.s0.y.v6, 0, t15) {
   let matrix_mul.s0.y.v4.base.s = min(matrix_mul.s0.y.v6*4, matrix_mul.extent.1 + -4)
   let t18 = matrix_mul.min.1 + matrix_mul.s0.y.v4.base.s
   for (matrix_mul.s0.x.v5, 0, t16) {
    let matrix_mul.s0.x.v3.base.s = min(matrix_mul.s0.x.v5*4, matrix_mul.extent.0 + -4)
    let t19 = matrix_mul.s0.x.v3.base.s + t17
    for (matrix_mul.s0.y.v4, 0, 4) {
     let t20 = ((matrix_mul.s0.y.v4 + t18)*matrix_mul.stride.1) + t19
     for (matrix_mul.s0.x.v3, 0, 4) {
      matrix_mul[matrix_mul.s0.x.v3 + t20] = 0.000000f
     }
    }
   }
  }
  let t23 = (matrix_mul.min.0 - (B.min.1*B.stride.1)) - B.min.0
  let t21 = 0 - (matrix_mul.min.1*matrix_mul.stride.1)
  let t22 = (A.min.1*A.stride.1) + A.min.0
  for (matrix_mul.s1.y.rebased, 0, matrix_mul.extent.1) {
   let t26 = matrix_mul.min.1 + matrix_mul.s1.y.rebased
   let t25 = (A.stride.1*t26) - t22
   let t24 = (matrix_mul.stride.1*t26) + t21
   for (matrix_mul.s1.x.rebased, 0, matrix_mul.extent.0) {
    let t27 = matrix_mul.s1.x.rebased + t24
    let t28 = matrix_mul.s1.x.rebased + t23
    for (matrix_mul.s1.k$x, 0, 1024) {
     matrix_mul[t27] = matrix_mul[t27] + (A[matrix_mul.s1.k$x + t25]*B[(B.stride.1*matrix_mul.s1.k$x) + t28])
    }
   }
  }
 }

Execution time got a bit worse (2.4 s) but anyhow it does not look right.
I still tried to run those tiles in parallel

Var xi, yi, xo, yo, tile_index;
    mm.matrix_mul.tile(mm.x, mm.y, xo, yo, xi, yi, 4, 4);
    mm.matrix_mul.fuse(xo, yo, tile_index);
    mm.matrix_mul.parallel(tile_index);

somehow consistently, parallel does only affect the initialization:

produce matrix_mul {
  let t15 = (matrix_mul.extent.0 + 3)/4
  let t16 = 0 - (matrix_mul.min.1*matrix_mul.stride.1)
  parallel (matrix_mul.s0.x.v5.v7, 0, matrix_mul.s0.x.v5.v7.loop_extent) {
   let matrix_mul.s0.x.v3.base.s = min((matrix_mul.s0.x.v5.v7 % t15)*4, matrix_mul.extent.0 + -4)
   let matrix_mul.s0.y.v4.base.s = min((matrix_mul.s0.x.v5.v7/t15)*4, matrix_mul.extent.1 + -4)
   let t17 = matrix_mul.s0.x.v3.base.s + t16
   let t18 = matrix_mul.min.1 + matrix_mul.s0.y.v4.base.s
   for (matrix_mul.s0.y.v4, 0, 4) {
    let t19 = ((matrix_mul.s0.y.v4 + t18)*matrix_mul.stride.1) + t17
    for (matrix_mul.s0.x.v3, 0, 4) {
     matrix_mul[matrix_mul.s0.x.v3 + t19] = 0.000000f
    }
   }
  }
  let t22 = (matrix_mul.min.0 - (B.min.1*B.stride.1)) - B.min.0
  let t20 = 0 - (matrix_mul.min.1*matrix_mul.stride.1)
  let t21 = (A.min.1*A.stride.1) + A.min.0
  for (matrix_mul.s1.y.rebased, 0, matrix_mul.extent.1) {
   let t25 = matrix_mul.min.1 + matrix_mul.s1.y.rebased
   let t24 = (A.stride.1*t25) - t21
   let t23 = (matrix_mul.stride.1*t25) + t20
   for (matrix_mul.s1.x.rebased, 0, matrix_mul.extent.0) {
    let t26 = matrix_mul.s1.x.rebased + t23
    let t27 = matrix_mul.s1.x.rebased + t22
    for (matrix_mul.s1.k$x, 0, 1024) {
     matrix_mul[t26] = matrix_mul[t26] + (A[matrix_mul.s1.k$x + t24]*B[(B.stride.1*matrix_mul.s1.k$x) + t27])
    }
   }
  }
 }

Without much persuation i also tried to vectorize in x dimension:

mm.matrix_mul.vectorize(mm.x, 4);

I at least does not give an error but also does not affect performance and again only seems to influence the initialization phase:

produce matrix_mul {
  let t18 = matrix_mul.min.1*matrix_mul.stride.1
  let t16 = (matrix_mul.extent.0 % 4) != 0
  let t14 = matrix_mul.extent.0/4
  let t17 = matrix_mul.extent.0 - t18
  let t15 = 0 - t18
  for (matrix_mul.s0.y.rebased, 0, matrix_mul.extent.1) {
   let t19 = ((matrix_mul.min.1 + matrix_mul.s0.y.rebased)*matrix_mul.stride.1) + t15
   for (matrix_mul.s0.x.x, 0, t14) {
    matrix_mul[ramp((matrix_mul.s0.x.x*4) + t19, 1, 4)] = x4(0.000000f)
   }
   if (t16) {
    matrix_mul[ramp((((matrix_mul.min.1 + matrix_mul.s0.y.rebased)*matrix_mul.stride.1) + t17) + -4, 1, 4)] = x4(0.000000f)
   }
  }
  let t22 = (matrix_mul.min.0 - (B.min.1*B.stride.1)) - B.min.0
  let t20 = 0 - (matrix_mul.min.1*matrix_mul.stride.1)
  let t21 = (A.min.1*A.stride.1) + A.min.0
  for (matrix_mul.s1.y.rebased, 0, matrix_mul.extent.1) {
   let t25 = matrix_mul.min.1 + matrix_mul.s1.y.rebased
   let t24 = (A.stride.1*t25) - t21
   let t23 = (matrix_mul.stride.1*t25) + t20
   for (matrix_mul.s1.x.rebased, 0, matrix_mul.extent.0) {
    let t26 = matrix_mul.s1.x.rebased + t23
    let t27 = matrix_mul.s1.x.rebased + t22
    for (matrix_mul.s1.k$x, 0, 1024) {
     matrix_mul[t26] = matrix_mul[t26] + (A[matrix_mul.s1.k$x + t24]*B[(B.stride.1*matrix_mul.s1.k$x) + t27])
    }
   }
  }
 }

Now at this time i am a bit crestfallen since i am not sure what is going on, and how anybody ever comes up with a good schedule :-)

My last resort was to just try out the performance test schedule from https://github.com/halide/Halide/blob/master/test/performance/matrix_multiplication.cpp and indeed this one works and leads to astonishing performance (0.04 s).

But this schedule is really obscure to me:

what is the point of having two Func's, one matrix_mul and one out ? I tried to comment out the scheduling instructions of matrix_mul, since out seems to be the one which really is used (.realize is called), but performance is much worse then.
why do the tiling+parallel operations work in this schedule (parallel is introduced at the very top level) but not for me?

I attach my source code where all the trials are separate commented sections, maybe someone is willing to take a look
main.zip
.

Answered by zvookin

Sep 21, 2021

Ashish answered above about having to schedule the update(N) stage. Reductions have an initialization or "pure" stage and then some number of updates that mutably change the buffer backing the result. Each of these stages can scheduled independently. the downside is scheduling must be applied to each separately. This is likely the main stumbling block here and I expect things will make more sense knowing about this.

In the performance example, vectorization within matrix_mul is accomplished by reordering a non-reduction dimension into the innermost loop. (To vectorize reductions, rfactor must be used. Reductions are specified as in order loops and thus to reorder operations, it must be pr…

View full answer

ashishUthama · 2021-09-21T13:51:46Z

ashishUthama
Sep 21, 2021

And inspected the generated code i could only see that the initialization (set to zero) looks to be processed in tiled fashion, whereas the actual computation still is exactly as before

Specifically about this: you need to refer to the update stage while scheduling.

0 replies

zvookin · 2021-09-21T14:48:38Z

zvookin
Sep 21, 2021
Maintainer

Ashish answered above about having to schedule the update(N) stage. Reductions have an initialization or "pure" stage and then some number of updates that mutably change the buffer backing the result. Each of these stages can scheduled independently. the downside is scheduling must be applied to each separately. This is likely the main stumbling block here and I expect things will make more sense knowing about this.

In the performance example, vectorization within matrix_mul is accomplished by reordering a non-reduction dimension into the innermost loop. (To vectorize reductions, rfactor must be used. Reductions are specified as in order loops and thus to reorder operations, it must be proven that the algorithm is algebraically allowed to be reordered without changing the result. Outside of floating-point rounding. This is what rfactor does. It is a bit tricky to wrap one's head around as it produces a new Func to represent what is conceptually similar to a split.)

The performance example uses out as a separate, pure or non-reduction function for the outermost tiling. Perhaps the best way to think of this is as choosing a tile size for the output and then scheduling those tiles. I believe the vectorize in matrix_mul covers the elementwise multiplication and the one in out covers the summation of products. It sort of dances around the reduction loop in the middle if you will.

It is probably better to think of Halide as a collection of tools for automating transformations that would be very tedious by hand rather than a very high level of abstraction.

0 replies

jsteinhofff · 2021-09-21T16:27:13Z

jsteinhofff
Sep 21, 2021
Author

Thanks a lot for your answers, thanks @zvookin for your detailed explanations. The update stage was the part which i was missing and with your answer i understand why. Actually i now added an explicit initialization to 0.0f in my code just to make clear that two stages are there. Its nice that Halide automatically did it when it "spotted" my += operator, on the other hand it is a bit to auto-magical to me, even a warning "uninitialized buffer used" would have made sense maybe.

Now with the update(0) call the scheduling operations have an effect, very simple tiling + parallel + vectorize(x) already brings a 10x improvement, which is extremely nice for the little amount of work required from me.

Still i have issues to understand the performance example, "it sort of dances around the reduction loop in the middle" is a lovely description, i guess i have to meditate a bit on it to let it sink in with all the details :-D

It is probably better to think of Halide as a collection of tools for automating transformations that would be very tedious by hand rather than a very high level of abstraction.

I agree, and i am totally looking for such tools. But on the other hand it is just the same: "convenient tools to automate" basically are an abstraction and they operate on an abstraction like the Func objects. But that is just my current impression, only looking into Halide few days now, definitely not having enough insights to have a well based opinion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trying out halide with matrix multiplication .. how did you ever come up with a good schedule? #6254

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

trying out halide with matrix multiplication .. how did you ever come up with a good schedule? #6254

jsteinhofff Sep 21, 2021

Replies: 3 comments

ashishUthama Sep 21, 2021

zvookin Sep 21, 2021 Maintainer

jsteinhofff Sep 21, 2021 Author

jsteinhofff
Sep 21, 2021

ashishUthama
Sep 21, 2021

zvookin
Sep 21, 2021
Maintainer

jsteinhofff
Sep 21, 2021
Author