Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will setting different lmul result in obvious performance difference? #237

Open
xlinsist opened this issue Jun 15, 2023 · 5 comments
Open
Assignees

Comments

@xlinsist
Copy link
Contributor

xlinsist commented Jun 15, 2023

In RVV, lmul stands for "Vector Register Group Multiplier" to specify the number of registers to create a group. Different lmul settings have different number of instructions generated when executing the same procudure. For example, setting lmul=4 requires twice as many instructions as setting lmul=8 when executing the same procedure. Although the additional instructions will cause extra fetching and decoding overhead, it shouldn't have too much impact on overall performance in theory.

However, it seems that experiments on an AXpY case based on stripmining method have demonstrated that various lmul will make a difference. As shown in the table below, non-computing overheads such as fetching and decoding account for about 4.0%((663,647 - 638,047) / 638,047) of the total cycles, and the performance gap caused by changing lmul=8 to lmul=1 is catching up to the original 28.4%( (819,295 - 638,047) / 638,047), which is exactly seven times the non-computing overheads when setting lmul=8.

Will setting different lmul result in obvious performance difference? Do we need further analysis on that?

AXPY implementation lmul=8 lmul=4 lmul=2 lmul=1
total cycles 638,047 663,647 716,895 819,295
delta between total cycles - 25,600 53,248 102,400

The AXpY case is as follows(sew=32, vlen=1024 and input lengh = 262145):

axpy.mlir:

// BUDDY-OPT
// --lower-affine --convert-scf-to-cf --convert-math-to-llvm
// --lower-vector-exp --lower-rvv=rv32
// --convert-vector-to-llvm --finalize-memref-to-llvm
// --convert-arith-to-llvm --convert-func-to-llvm
// --reconcile-unrealized-casts
// BUDDY-OPT-END

memref.global "private" @gv_i32 : memref<262145xi32> // 262145 = 256 * 1024 + 1

func.func @test() -> i32 {

  %input1 = memref.get_global @gv_i32 : memref<262145xi32>
  %input2 = memref.get_global @gv_i32 : memref<262145xi32>
  %output = memref.get_global @gv_i32 : memref<262145xi32>

  %c0 = arith.constant 0 : index
  %c0_i32 = arith.constant 0 : i32
  %dim = memref.dim %input1, %c0 : memref<262145xi32>
  %dim_i32 = arith.index_cast %dim : index to i32

  // Configure the register.
  // SEW = 32
  %sew = arith.constant 2 : i32
  // LMUL = 8
  %lmul = arith.constant 3 : i32

  // Constant mask configuration.
  %mask = arith.constant dense<1> : vector<[16]xi1>
  %a_element = affine.load %input1[%c0] : memref<262145xi32>

  // While loop for strip-mining.
  %tmp_avl, %tmp_idx = scf.while (%avl = %dim_i32, %idx = %c0) : (i32, index) -> (i32, index) {
    // If avl greater than zero.
    %cond = arith.cmpi sgt, %avl, %c0_i32 : i32
    // Pass avl, idx to the after region.
    scf.condition(%cond) %avl, %idx : i32, index
  } do {
  ^bb0(%avl : i32, %idx : index):
    // Perform the calculation according to the vl.
    %vl = rvv.setvl %avl, %sew, %lmul : i32
    %x_vector = vector_exp.predication %mask, %vl : vector<[16]xi1>, i32 {
      %ele = vector.load %input1[%idx] : memref<262145xi32>, vector<[16]xi32>
      vector.yield %ele : vector<[16]xi32>
    } : vector<[16]xi32>
    %y_vector = vector_exp.predication %mask, %vl : vector<[16]xi1>, i32 {
      %ele = vector.load %input2[%idx] : memref<262145xi32>, vector<[16]xi32>
      vector.yield %ele : vector<[16]xi32>
    } : vector<[16]xi32>
    %mul_vector = rvv.mul %x_vector, %a_element, %vl : vector<[16]xi32>, i32, i32
    %result_vector = rvv.add %mul_vector, %y_vector, %vl : vector<[16]xi32>, vector<[16]xi32>, i32
    vector_exp.predication %mask, %vl : vector<[16]xi1>, i32 {
      vector.store %result_vector, %output[%idx] : memref<262145xi32>, vector<[16]xi32>
      vector.yield
    } : () -> ()
    // Update idx and avl.
    %vl_ind = arith.index_cast %vl : i32 to index
    %new_idx = arith.addi %idx, %vl_ind : index
    %new_avl = arith.subi %avl, %vl : i32
    scf.yield %new_avl, %new_idx : i32, index
  }

  %result = vector.load %output[%c0] : memref<262145xi32>, vector<8xi32>

  %mask_res = arith.constant dense<1> : vector<8xi1>
  %c1_i32 = arith.constant 1 : i32
  %evl = arith.constant 8 : i32
  %res_reduce_add_mask_driven = "llvm.intr.vp.reduce.add" (%c1_i32, %result, %mask_res, %evl) :
        (i32, vector<8xi32>, vector<8xi1>, i32) -> i32

  return %res_reduce_add_mask_driven : i32
}

@sequencer
Copy link
Member

Should be fixed by #240

@xlinsist
Copy link
Contributor Author

xlinsist commented Jun 24, 2023

After retesting, it is verified that:

  1. When setting various lmul, scenarios with higher workload between load&store (e.g. Conv2d) have smaller performance gaps than situations with simple AX plus Y.
  2. After merging PRs, setting various lmul will still result in noticeable performance differences(over 10%), and it seems that the percentage of which isn't better than previously. Do you think it is normal?
Conv2d with input size=260 lmul=8 lmul=4 lmul=2 lmul=1
total cycles 2,247,794 2,284,946 2,373,182 2,531,078

ratio of difference(more workload between load&store) = 12.6% < ratio of difference(simple AX plus Y) = 34.6%

AXPY with input lengh = 262145 lmul=8 lmul=4 lmul=2 lmul=1
total cycles(before merging PRs) 638,047 663,647 716,895 819,295
total cycles(after merging PRs) 645,228 675,948 737,388 868,460

ratio of difference(before) = (total_cycles(lmul=1) - total_cycles(lmul=8)) / total_cycles(lmul=8) * 100% = 28.4%
ratio of difference(after) = 34.6% > ratio of difference(before) = 28.4%

@sequencer
Copy link
Member

should be fixed by #245, can you have another try? @xlinsist

@xlinsist
Copy link
Contributor Author

xlinsist commented Jul 5, 2023

The retest's results seem promising. The procedures now run faster than before and the gap between the total cycles of different lmuls has been narrowed (except for the case of lmul=1). As for lmul=1, do you have any understanding about it? @sequencer

Conv2d with input size=260 lmul=8 lmul=4 lmul=2 lmul=1
total cycles(previous) 2,247,794 2,284,946 2,373,182 2,531,078
total cycles(current) 2,215,286 2,212,964 2,208,320 2,496,248

ratio of difference(excluded lmul=1) = (total_cycles(lmul=8) - total_cycles(lmul=2)) / total_cycles(lmul=2) * 100% = 0.3%, which can be safely ignored.

ratio of difference = (total_cycles(lmul=1) - total_cycles(lmul=2)) / total_cycles(lmul=2) * 100% = 13.0%, which is relatively noticable.

AXPY with input lengh = 262145 lmul=8 lmul=4 lmul=2 lmul=1
total cycles(previous) 645,228 675,948 737,388 868,460
total cycles(current) 654,437 645,221 684,133 811,109

ratio of difference(excluded lmul=1) = (total_cycles(lmul=2) - total_cycles(lmul=4)) / total_cycles(lmul=4) * 100% = 6.0%, which I assume to be reasonable given its low computational demand.

@sequencer
Copy link
Member

It is possibly caused by VFU hazard l, ask @SharzyL for making sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants