Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjectFifo Matmul + Elementwise #644

Open
jtuyls opened this issue Aug 5, 2024 · 2 comments
Open

ObjectFifo Matmul + Elementwise #644

jtuyls opened this issue Aug 5, 2024 · 2 comments
Assignees

Comments

@jtuyls
Copy link
Collaborator

jtuyls commented Aug 5, 2024

The main failure in matmul + elementwise seems to be caused by too many connections being created. There are multiple ways this could be solved:

  1. Packet routing, so streams can be reused for multiple connections.
  2. Reprogramming DMAs and routes after matmul data has been moved in, to create new routes for elementwise data movement.
  3. Combining connections together whenever possible. This could either be combining A and B connections, so the same streams are used for these, or combining the elementwise constant data connection with the A or B connection.

As there is no e2e support yet for the more general approaches 1 and 2, the current thought is to implement approach 3. This also still has some potential long term benefits as depending on how this is implemented and what operations are targeted, this could give very good performance as well.

For approach 3, we need following fixes and new transformations in the flow:

  1. Fix access ops not at the right place (@jtuyls)
    Fixed by :
  2. Split temp memrefs in distribute pass (one 4x4x4x4x4 -> 4 separate 4x4x4x4 memrefs) (@Abhishek-Varma )
    Fixed by :
  3. Linearize logical objectfifo memrefs (@yzhang93 )
    Fixed by :
  4. Combine DMAs and insert additional reads after same number of accesses in both cores (@jtuyls)
yzhang93 added a commit that referenced this issue Aug 6, 2024
Abhishek-Varma added a commit that referenced this issue Aug 7, 2024
)

-- In Matmul+Elemwise we get to see the intermediate L1 buffers for
matmuls :-
```
        alloc -> subview -> access (within amdaie.core)
```
-- We should replace the subview with a narrowed alloc itself for this
case as well.
-- This commit therefore addresses that as part of
`--iree-amdaie-distribute-cores-and-objectfifos` pass.
-- Addresses sub-action item `2` from
#644

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Aug 8, 2024
-- This commit introduces a new pass `--iree-amdaie-split-buffers`
   to split L2 buffers for dealing with Matmul+Elementwise.
-- It addresses sub-action 2 as well from #644

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Aug 9, 2024
-- This commit introduces a new pass `--iree-amdaie-split-buffers`
   to split L2 buffers for dealing with Matmul+Elementwise.
-- It addresses sub-action 2 as well from #644

Signed-off-by: Abhishek Varma <[email protected]>
@jtuyls
Copy link
Collaborator Author

jtuyls commented Aug 12, 2024

@Abhishek-Varma

For 4, see the below snippet for a sample in/output:

NOTE: The circular DMA objectfifos need to be decoupled from the actual underlying memref argument, so a single circular DMA can be reused for multiple different memref arguments (see ARG_NEW in the expected output). I will have a look at how to update the ops to accomplish this.

%tile_0_2 = amdaie.tile(%c0, %c2)
%tile_1_2 = amdaie.tile(%c1, %c2)
%tile_0_3 = amdaie.tile(%c0, %c3)
%tile_1_3 = amdaie.tile(%c1, %c3)

%0 = amdaie.circular_dma_cpy_nd(%arg1[] [] [], %arg0[] [] []) : (!amdaie.logicalobjectfifo<memref<2x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<2x16xi32>>)
%1 = amdaie.circular_dma_cpy_nd(%arg2[] [] [], %arg1[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<2x16xi32, 1>>)
%2 = amdaie.circular_dma_cpy_nd(%arg3[] [] [], %arg1[16] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<2x16xi32, 1>>)

%3 = amdaie.circular_dma_cpy_nd(%arg5[] [] [], %arg4[] [] []) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<1x16xi32>>)
%4 = amdaie.circular_dma_cpy_nd(%arg6[] [] [], %arg5[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 4>>, !amdaie.logicalobjectfifo<memref<1x16xi32, 1>>)

%5 = amdaie.circular_dma_cpy_nd(%arg7[] [] [], %arg4[] [] []) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<4x16xi32>>)
%6 = amdaie.circular_dma_cpy_nd(%arg8[] [] [], %arg7[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<1x16xi32, 1>>)

%7 = amdaie.circular_dma_cpy_nd(%arg9[] [] [], %arg4[] [] []) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<4x16xi32>>)
%8 = amdaie.circular_dma_cpy_nd(%arg10[] [] [], %arg9[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<1x16xi32, 1>>)

%9 = amdaie.circular_dma_cpy_nd(%arg11[] [] [], %arg4[] [] []) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<4x16xi32>>)
%10 = amdaie.circular_dma_cpy_nd(%arg12[] [] [], %arg11[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<1x16xi32, 1>>)


%core_0_2 = amdaie.core(%tile_0_2, in : [%1, %4], out : []) {
   %access_0 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_0: memref<1x16xi32, 2>)
    %access_1 = amdaie.logicalobjectfifo.access(%arg6, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_1: memref<1x16xi32, 2>)
    amdaie.end
}
%core_1_2 = amdaie.core(%tile_1_2, in : [%1, %6], out : []) {
   %access_2 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_2: memref<1x16xi32, 2>)
    %access_2 = amdaie.logicalobjectfifo.access(%arg8, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_2: memref<1x16xi32, 2>)
    amdaie.end
}
%core_0_3 = amdaie.core(%tile_0_3, in : [%2, %8], out : []) {
   %access_3 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_3: memref<1x16xi32, 2>)
    %access_4 = amdaie.logicalobjectfifo.access(%arg10, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_4: memref<1x16xi32, 2>)
    amdaie.end
}
%core_1_3 = amdaie.core(%tile_1_3, in : [%2, %10], out : []) {
    %access_5 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_5: memref<1x16xi32, 2>)
    %access_6 = amdaie.logicalobjectfifo.access(%arg12, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_12: memref<1x16xi32, 2>)
    amdaie.end
}

amdaie.controlcode {
    %npu0 = amdaie.npu.dma_cpy_nd %0([] [] [], [0] [32] [1])
    %npu1 = amdaie.npu.dma_cpy_nd %3([] [] [], [0] [16] [1])
    %npu2 = amdaie.npu.dma_cpy_nd %5([] [] [], [16] [16] [1])
    %npu3 = amdaie.npu.dma_cpy_nd %7([] [] [], [32] [16] [1])
    %npu4 = amdaie.npu.dma_cpy_nd %9([] [] [], [48] [16] [1])
}

Expected output:

%tile_0_2 = amdaie.tile(%c0, %c2)
%tile_1_2 = amdaie.tile(%c1, %c2)
%tile_0_3 = amdaie.tile(%c0, %c3)
%tile_1_3 = amdaie.tile(%c1, %c3)

%0 = amdaie.circular_dma_cpy_nd(%arg1[] [] [], %ARG_NEW[] [] []) : (!amdaie.logicalobjectfifo<memref<2x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<2x16xi32>>)
%1 = amdaie.circular_dma_cpy_nd(%arg2[] [] [], %arg1[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<2x16xi32, 1>>)
%2 = amdaie.circular_dma_cpy_nd(%arg3[] [] [], %arg1[16] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<2x16xi32, 1>>)

%core_0_2 = amdaie.core(%tile_0_2, in : [%1], out : []) {
   %access_0 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_0: memref<1x16xi32, 2>)
    %access_1 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_1: memref<1x16xi32, 2>)
    // Read access objectFifo again, but don't use it as data is not needed in this core (but in core_1_2).
    %access_2 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    amdaie.end
}
%core_1_2 = amdaie.core(%tile_1_2, in : [%1], out : []) {
    %access_3 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_3: memref<1x16xi32, 2>)
    // Read access objectFifo, but don't use it as data is not needed in this core (but in core_0_2).
    %access_4 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    // Read access objectFifo again, but now use the data.
    %access_5 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_5: memref<1x16xi32, 2>)
    amdaie.end
}
%core_0_3 = amdaie.core(%tile_0_3, in : [%2, %8], out : []) {
    %access_6 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_6: memref<1x16xi32, 2>)
    %access_7 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_7: memref<1x16xi32, 2>)
    // Read access objectFifo again, but don't use it as data is not needed in this core (but in core_1_3).
    %access_8 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    amdaie.end
}
%core_1_3 = amdaie.core(%tile_1_3, in : [%2, %10], out : []) {
    %access_9 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_9: memref<1x16xi32, 2>)
    // Read access objectFifo, but don't use it as data is not needed in this core (but in core_0_3).
    %access_10 = amdaie.logicalobjectfifo.access(%arg10, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    // Read access objectFifo again, but now use the data.
    %access_11 = amdaie.logicalobjectfifo.access(%arg11, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_11: memref<1x16xi32, 2>)
    amdaie.end
}

amdaie.controlcode {
    %npu0 = amdaie.npu.dma_cpy_nd %0([] [] [], [0] [32] [1])
    %npu1 = amdaie.npu.dma_cpy_nd %0([] [] [], [0, 0] [2, 16] [32, 1])
    %npu2 = amdaie.npu.dma_cpy_nd %0([] [] [], [0, 16] [2, 16] [32, 1])
}

Abhishek-Varma added a commit that referenced this issue Aug 12, 2024
-- This commit introduces a new pass `--iree-amdaie-split-buffers`
   to split L2 buffers for dealing with Matmul+Elementwise.
-- It addresses sub-action 2 as well from #644

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Aug 14, 2024
-- This commit introduces a new pass `--iree-amdaie-split-buffers`
   to split L2 buffers for dealing with Matmul+Elementwise.
-- It addresses sub-action 2 as well from #644

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Aug 29, 2024
-- This commit introduces a new pass `--iree-amdaie-split-buffers`
   to split L2 buffers for dealing with Matmul+Elementwise.
-- It addresses sub-action 2 as well from #644

Signed-off-by: Abhishek Varma <[email protected]>
@jtuyls
Copy link
Collaborator Author

jtuyls commented Sep 2, 2024

@Abhishek-Varma Here is an example of input and expected output for point 4 above, showing how the C DMAs are combined with the B ones and additional read accesses are inserted to accommodate broadcasted data.

Input:

%15 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%16 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg1)
%17 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%18 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg1)
%19 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%20 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%21 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%22 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg0)
%23 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg0)
%24 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%43 = amdaie.dma_cpy_nd(%6[0, 0, 0, 0] [2, 1, 32, 32] [1024, 1024, 32, 1], %8[0, 0, %24, 224] [2, 1, 32, 32] [8192, 32, 256, 1]) : (!amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x256xi32>>)
%44 = amdaie.dma_cpy_nd(%5[0, 0, 0, 0] [1, 2, 32, 32] [2048, 1024, 32, 1], %10[0, 0, 224, %19] [1, 2, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<256x128xi32>>)
%45 = amdaie.dma_cpy_nd(%1[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %20, %15] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%46 = amdaie.dma_cpy_nd(%2[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %21, %16] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%47 = amdaie.dma_cpy_nd(%3[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %22, %17] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%48 = amdaie.dma_cpy_nd(%4[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %23, %18] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%57 = amdaie.dma_cpy_nd(%28[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [1024, 1024, 128, 32, 4, 1], %5[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [2048, 1024, 4, 256, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%58 = amdaie.dma_cpy_nd(%27[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [1024, 1024, 128, 32, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [2048, 1024, 4, 256, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%59 = amdaie.dma_cpy_nd(%30[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 256, 32, 8, 1], %6[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 8, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>)
%60 = amdaie.dma_cpy_nd(%52[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %1[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%61 = amdaie.dma_cpy_nd(%0[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %56[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
%62 = amdaie.core(%tile_15, in : [%59, %57, %60], out : [%61]) {
  %74 = amdaie.logicalobjectfifo.access(%30, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%34, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  %77 = amdaie.logicalobjectfifo.access(%52, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%56, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%63 = amdaie.dma_cpy_nd(%50[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %2[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%64 = amdaie.dma_cpy_nd(%0[0, 1, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %54[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
%65 = amdaie.core(%tile_13, in : [%59, %58, %63], out : [%64]) {
  %74 = amdaie.logicalobjectfifo.access(%30, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%36, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  %77 = amdaie.logicalobjectfifo.access(%50, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%54, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%66 = amdaie.dma_cpy_nd(%29[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 256, 32, 8, 1], %6[1, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 8, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>)
%67 = amdaie.dma_cpy_nd(%51[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %3[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%68 = amdaie.dma_cpy_nd(%0[1, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %55[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
%69 = amdaie.core(%tile_14, in : [%66, %57, %67], out : [%68]) {
  %74 = amdaie.logicalobjectfifo.access(%29, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%39, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  %77 = amdaie.logicalobjectfifo.access(%51, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%55, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%70 = amdaie.dma_cpy_nd(%49[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %4[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%71 = amdaie.dma_cpy_nd(%0[1, 1, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %53[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
%72 = amdaie.core(%tile_12, in : [%66, %58, %70], out : [%71]) {
  %74 = amdaie.logicalobjectfifo.access(%29, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%41, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  %77 = amdaie.logicalobjectfifo.access(%49, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%53, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%73 = amdaie.dma_cpy_nd(%14[%24, %19] [64, 64] [128, 1], %0[0, 0, 0, 0] [2, 32, 2, 32] [2048, 32, 1024, 1]) : (!amdaie.logicalobjectfifo<memref<128x128xi32>>, !amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>)

Expected output:

%15 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%16 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg1)
%17 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%18 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg1)
%19 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%20 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%21 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%22 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg0)
%23 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg0)
%24 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%43 = amdaie.dma_cpy_nd(%6[0, 0, 0, 0] [2, 1, 32, 32] [1024, 1024, 32, 1], %8[0, 0, %24, 224] [2, 1, 32, 32] [8192, 32, 256, 1]) : (!amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x256xi32>>)
%44 = amdaie.dma_cpy_nd(%5[0, 0, 0, 0] [1, 2, 32, 32] [2048, 1024, 32, 1], %10[0, 0, 224, %19] [1, 2, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<256x128xi32>>)
// [OLD] %45 = amdaie.dma_cpy_nd(%1[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %20, %15] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
// [OLD] %46 = amdaie.dma_cpy_nd(%2[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %21, %16] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
// [OLD] %47 = amdaie.dma_cpy_nd(%3[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %22, %17] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
// [OLD] %48 = amdaie.dma_cpy_nd(%4[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %23, %18] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%45_46 = amdaie.dma_cpy_nd(%1[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %20, 0] [1, 2, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%47_48 = amdaie.dma_cpy_nd(%1[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %22, 0] [1, 2, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%57 = amdaie.dma_cpy_nd(%28[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [1024, 1024, 128, 32, 4, 1], %5[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [2048, 1024, 4, 256, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%58 = amdaie.dma_cpy_nd(%27[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [1024, 1024, 128, 32, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [2048, 1024, 4, 256, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%59 = amdaie.dma_cpy_nd(%30[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 256, 32, 8, 1], %6[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 8, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>)
// [OLD] %60 = amdaie.dma_cpy_nd(%52[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %1[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%60 = amdaie.dma_cpy_nd(%28[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %5[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%61 = amdaie.dma_cpy_nd(%0[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %56[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
// core(0, 2)
%62 = amdaie.core(%tile_15, in : [%59, %57, %60], out : [%61]) {
  %74 = amdaie.logicalobjectfifo.access(%30, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%34, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  // Operate on the first read from `%28` (broadcasted to this core and core(0, 3))
  %77 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%56, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  // Perform another read of `%28` because the data is broadcasted and core(0, 3) will operate on it
  %77_new = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  amdaie.end
}
// [OLD] %63 = amdaie.dma_cpy_nd(%50[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %2[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%63 = amdaie.dma_cpy_nd(%27[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%64 = amdaie.dma_cpy_nd(%0[0, 1, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %54[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
// core(1, 2)
%65 = amdaie.core(%tile_13, in : [%59, %58, %63], out : [%64]) {
  %74 = amdaie.logicalobjectfifo.access(%30, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%36, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  // Operate on the first read from `%27` (broadcasted to this core and core(1, 3))
  %77 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%54, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  // Perform another read of `%28` because the data is broadcasted and core(1, 3) will operate on it
  %77_new = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  amdaie.end
}
%66 = amdaie.dma_cpy_nd(%29[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 256, 32, 8, 1], %6[1, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 8, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>)
// %67 = amdaie.dma_cpy_nd(%51[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %3[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%67 = amdaie.dma_cpy_nd(%28[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%68 = amdaie.dma_cpy_nd(%0[1, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %55[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
// core(0, 3)
%69 = amdaie.core(%tile_14, in : [%66, %57, %67], out : [%68]) {
  %74 = amdaie.logicalobjectfifo.access(%29, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%39, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  // Perform a first read of `%28` because the data is broadcasted and core(0, 2) will operate on it
  %77 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  // Perform another read from `%28` because this core will operate on the second read
  %77_new = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%55, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77_new : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
// [OLD] %70 = amdaie.dma_cpy_nd(%49[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %4[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%70 = amdaie.dma_cpy_nd(%27[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%71 = amdaie.dma_cpy_nd(%0[1, 1, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %53[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
// core(1, 3)
%72 = amdaie.core(%tile_12, in : [%66, %58, %70], out : [%71]) {
  %74 = amdaie.logicalobjectfifo.access(%29, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%41, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  // Perform a first read of `%27` because the data is broadcasted and core(0, 3) will operate on it
  %77 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %77_new = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  // Perform another read from `%27` because this core will operate on the second read
  %78 = amdaie.logicalobjectfifo.access(%53, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77_new : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%73 = amdaie.dma_cpy_nd(%14[%24, %19] [64, 64] [128, 1], %0[0, 0, 0, 0] [2, 32, 2, 32] [2048, 32, 1024, 1]) : (!amdaie.logicalobjectfifo<memref<128x128xi32>>, !amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>)

Abhishek-Varma added a commit that referenced this issue Sep 2, 2024
-- This commit introduces a new pass `--iree-amdaie-split-buffers`
   to split L2 buffers for dealing with Matmul+Elementwise.
-- It addresses sub-action 2 as well from #644

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Sep 2, 2024
-- This commit introduces a new pass
`--iree-amdaie-split-logical-objectfifos-for-connection-reuse` to
split logical objectFifos for dealing with Matmul+Elementwise.
-- Also contains a utility to check whether splitting can be performed.
-- It addresses sub-action 2 as well from
#644

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Sep 10, 2024
-- This commit adds a new pass
   `--iree-amdaie-logical-objectfifos-for-connection-reuse`.
-- Essentially follows the narrative after splitting of logical objectFifos
   and is aimed to address point 4 of #644.

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Sep 10, 2024
-- This commit adds a new pass
   `--iree-amdaie-logical-objectfifos-for-connection-reuse`.
-- Essentially follows the narrative after splitting of logical objectFifos
   and is aimed to address point 4 of #644.

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Sep 10, 2024
-- This commit adds a new pass
   `--iree-amdaie-logical-objectfifos-for-connection-reuse`.
-- Essentially follows the narrative after splitting of logical objectFifos
   and is aimed to address point 4 of #644.

Signed-off-by: Abhishek Varma <[email protected]>
Abhishek-Varma added a commit that referenced this issue Sep 11, 2024
-- This commit adds a new pass
   `--iree-amdaie-logical-objectfifos-for-connection-reuse`.
-- Essentially follows the narrative after splitting of logical objectFifos
   and is aimed to address point 4 of #644.

Signed-off-by: Abhishek Varma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants