Will setting different `lmul` result in obvious performance difference? #237

xlinsist · 2023-06-15T18:44:17Z

In RVV, lmul stands for "Vector Register Group Multiplier" to specify the number of registers to create a group. Different lmul settings have different number of instructions generated when executing the same procudure. For example, setting lmul=4 requires twice as many instructions as setting lmul=8 when executing the same procedure. Although the additional instructions will cause extra fetching and decoding overhead, it shouldn't have too much impact on overall performance in theory.

However, it seems that experiments on an AXpY case based on stripmining method have demonstrated that various lmul will make a difference. As shown in the table below, non-computing overheads such as fetching and decoding account for about 4.0%((663,647 - 638,047) / 638,047) of the total cycles, and the performance gap caused by changing lmul=8 to lmul=1 is catching up to the original 28.4%( (819,295 - 638,047) / 638,047), which is exactly seven times the non-computing overheads when setting lmul=8.

Will setting different lmul result in obvious performance difference? Do we need further analysis on that?

AXPY implementation	lmul=8	lmul=4	lmul=2	lmul=1
total cycles	638,047	663,647	716,895	819,295
delta between total cycles	-	25,600	53,248	102,400

The AXpY case is as follows(sew=32, vlen=1024 and input lengh = 262145):

axpy.mlir:

// BUDDY-OPT
// --lower-affine --convert-scf-to-cf --convert-math-to-llvm
// --lower-vector-exp --lower-rvv=rv32
// --convert-vector-to-llvm --finalize-memref-to-llvm
// --convert-arith-to-llvm --convert-func-to-llvm
// --reconcile-unrealized-casts
// BUDDY-OPT-END

memref.global "private" @gv_i32 : memref<262145xi32> // 262145 = 256 * 1024 + 1

func.func @test() -> i32 {

  %input1 = memref.get_global @gv_i32 : memref<262145xi32>
  %input2 = memref.get_global @gv_i32 : memref<262145xi32>
  %output = memref.get_global @gv_i32 : memref<262145xi32>

  %c0 = arith.constant 0 : index
  %c0_i32 = arith.constant 0 : i32
  %dim = memref.dim %input1, %c0 : memref<262145xi32>
  %dim_i32 = arith.index_cast %dim : index to i32

  // Configure the register.
  // SEW = 32
  %sew = arith.constant 2 : i32
  // LMUL = 8
  %lmul = arith.constant 3 : i32

  // Constant mask configuration.
  %mask = arith.constant dense<1> : vector<[16]xi1>
  %a_element = affine.load %input1[%c0] : memref<262145xi32>

  // While loop for strip-mining.
  %tmp_avl, %tmp_idx = scf.while (%avl = %dim_i32, %idx = %c0) : (i32, index) -> (i32, index) {
    // If avl greater than zero.
    %cond = arith.cmpi sgt, %avl, %c0_i32 : i32
    // Pass avl, idx to the after region.
    scf.condition(%cond) %avl, %idx : i32, index
  } do {
  ^bb0(%avl : i32, %idx : index):
    // Perform the calculation according to the vl.
    %vl = rvv.setvl %avl, %sew, %lmul : i32
    %x_vector = vector_exp.predication %mask, %vl : vector<[16]xi1>, i32 {
      %ele = vector.load %input1[%idx] : memref<262145xi32>, vector<[16]xi32>
      vector.yield %ele : vector<[16]xi32>
    } : vector<[16]xi32>
    %y_vector = vector_exp.predication %mask, %vl : vector<[16]xi1>, i32 {
      %ele = vector.load %input2[%idx] : memref<262145xi32>, vector<[16]xi32>
      vector.yield %ele : vector<[16]xi32>
    } : vector<[16]xi32>
    %mul_vector = rvv.mul %x_vector, %a_element, %vl : vector<[16]xi32>, i32, i32
    %result_vector = rvv.add %mul_vector, %y_vector, %vl : vector<[16]xi32>, vector<[16]xi32>, i32
    vector_exp.predication %mask, %vl : vector<[16]xi1>, i32 {
      vector.store %result_vector, %output[%idx] : memref<262145xi32>, vector<[16]xi32>
      vector.yield
    } : () -> ()
    // Update idx and avl.
    %vl_ind = arith.index_cast %vl : i32 to index
    %new_idx = arith.addi %idx, %vl_ind : index
    %new_avl = arith.subi %avl, %vl : i32
    scf.yield %new_avl, %new_idx : i32, index
  }

  %result = vector.load %output[%c0] : memref<262145xi32>, vector<8xi32>

  %mask_res = arith.constant dense<1> : vector<8xi1>
  %c1_i32 = arith.constant 1 : i32
  %evl = arith.constant 8 : i32
  %res_reduce_add_mask_driven = "llvm.intr.vp.reduce.add" (%c1_i32, %result, %mask_res, %evl) :
        (i32, vector<8xi32>, vector<8xi1>, i32) -> i32

  return %res_reduce_add_mask_driven : i32
}

The text was updated successfully, but these errors were encountered:

sequencer · 2023-06-20T05:43:01Z

Should be fixed by #240

xlinsist · 2023-06-24T17:21:42Z

After retesting, it is verified that:

When setting various lmul, scenarios with higher workload between load&store (e.g. Conv2d) have smaller performance gaps than situations with simple AX plus Y.
After merging PRs, setting various lmul will still result in noticeable performance differences(over 10%), and it seems that the percentage of which isn't better than previously. Do you think it is normal?

Conv2d with input size=260	lmul=8	lmul=4	lmul=2	lmul=1
total cycles	2,247,794	2,284,946	2,373,182	2,531,078

ratio of difference(more workload between load&store) = 12.6% < ratio of difference(simple AX plus Y) = 34.6%

AXPY with input lengh = 262145	lmul=8	lmul=4	lmul=2	lmul=1
total cycles(before merging PRs)	638,047	663,647	716,895	819,295
total cycles(after merging PRs)	645,228	675,948	737,388	868,460

ratio of difference(before) = (total_cycles(lmul=1) - total_cycles(lmul=8)) / total_cycles(lmul=8) * 100% = 28.4%
ratio of difference(after) = 34.6% > ratio of difference(before) = 28.4%

sequencer · 2023-07-03T19:19:37Z

should be fixed by #245, can you have another try? @xlinsist

xlinsist · 2023-07-05T16:32:17Z

The retest's results seem promising. The procedures now run faster than before and the gap between the total cycles of different lmuls has been narrowed (except for the case of lmul=1). As for lmul=1, do you have any understanding about it? @sequencer

Conv2d with input size=260	lmul=8	lmul=4	lmul=2	lmul=1
total cycles(previous)	2,247,794	2,284,946	2,373,182	2,531,078
total cycles(current)	2,215,286	2,212,964	2,208,320	2,496,248

ratio of difference(excluded lmul=1) = (total_cycles(lmul=8) - total_cycles(lmul=2)) / total_cycles(lmul=2) * 100% = 0.3%, which can be safely ignored.

ratio of difference = (total_cycles(lmul=1) - total_cycles(lmul=2)) / total_cycles(lmul=2) * 100% = 13.0%, which is relatively noticable.

AXPY with input lengh = 262145	lmul=8	lmul=4	lmul=2	lmul=1
total cycles(previous)	645,228	675,948	737,388	868,460
total cycles(current)	654,437	645,221	684,133	811,109

ratio of difference(excluded lmul=1) = (total_cycles(lmul=2) - total_cycles(lmul=4)) / total_cycles(lmul=4) * 100% = 6.0%, which I assume to be reasonable given its low computational demand.

sequencer · 2023-07-05T16:56:23Z

It is possibly caused by VFU hazard l, ask @SharzyL for making sure.

sequencer assigned SharzyL Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will setting different `lmul` result in obvious performance difference? #237

Will setting different `lmul` result in obvious performance difference? #237

xlinsist commented Jun 15, 2023 •

edited

Loading

sequencer commented Jun 20, 2023

xlinsist commented Jun 24, 2023 •

edited

Loading

sequencer commented Jul 3, 2023

xlinsist commented Jul 5, 2023

sequencer commented Jul 5, 2023

Will setting different lmul result in obvious performance difference? #237

Will setting different lmul result in obvious performance difference? #237

Comments

xlinsist commented Jun 15, 2023 • edited Loading

sequencer commented Jun 20, 2023

xlinsist commented Jun 24, 2023 • edited Loading

sequencer commented Jul 3, 2023

xlinsist commented Jul 5, 2023

sequencer commented Jul 5, 2023

Will setting different `lmul` result in obvious performance difference? #237

Will setting different `lmul` result in obvious performance difference? #237

xlinsist commented Jun 15, 2023 •

edited

Loading

xlinsist commented Jun 24, 2023 •

edited

Loading