[CK Tile] Generic attention masking support for FMHA fwd and bwd #1340

cameronshinn · 2024-06-14T07:37:57Z

The goal of these changes is to support generic attention masks for the FMHA operator in CK tile. The motivation is to support a variety of masking strategies from existing research (beyond just the existing causal masking). The two that I aim to add support for are:

Big Bird

Longformer

With the existing masking from SimplifiedGenericAttentionMask, it was only possible to create masks as a diagonal window, which can only let us do windowed attention or causal attention. Additionally, the FMHA fwd/bwd pipelines make the assumption that the masked tiles in a tile row (column for backwards) are contiguous. This can't support the Big Bird and Longformer masks.

These changes instead let the main mask interface, GenericAttentionMask, accept a mask definition object, which is where the mask-specific details are contained. A different mask definition can be passed in for different kinds of masks. For example, DiagonalMask mimics the previous method of windowed masking. The required signature of a mask definition is laid out in the MaskDefABC struct.

Masks can also be defined at a tile granularity instead of a per-element granularity, signified with IsTileMask. This is helpful since Big Bird uses block sparsity. Tile sizes need to be members of the struct somehow, and I found it easier to make them template parameters (x_tile, y_tile).

The pipelines have been modified to use an IndexIterator to skip to the next non-zero tile, since they can now be non-contiguous. The index iterator loops through the tile mask indices, checking through incrementing indices until a non-zero tile is found.

From what I can tell, the tradeoffs are such:

😊 Construct a variety of mask types easily
😊 Scalable to arbitrary sequence lengths
😊 Mask is defined in instructions rather than sparse data structure arrays
☹️ Mask size is unknown without evaluating the predicate across the entire attention matrix index space
☹️ Next non-zero in a row can't be determined without checking every index in-between

I am opening this as a draft PR to initiate any discussion. I currently am still working on adding in mask definitions for Big Bird and Longformer as well as some performance results to show (verifying that there isn't any perf regression for the existing masking).

cameronshinn added 12 commits May 23, 2024 16:00

Add support for block sparse fmha pipeline

9b348a4

Pipeline tile loop works with non-contiguous mask

c5217a3

Remove separate file for sparse pipeline

546b29b

Update all pipelines non-contiguous mask support

50a1533

Attempt to use tile sizes as members

258c8ab

Compiles

3d59e7a

Merge branch 'develop' into ck_tile/bsp_fmha

be3d833

Fix early exit for empty tile row

2e56caa

Remove unused variable

58f36e8

Fix tile window stepping to with masking

c79987d

Fix variable names and args

0608a50

Small error fixes

1a83f20

cameronshinn requested review from zjing14, junliume, illsilin, carlushuang and aosewski as code owners June 14, 2024 07:37

cameronshinn marked this pull request as draft June 14, 2024 07:38

cameronshinn mentioned this pull request Jun 14, 2024

Initial implementation of block sparse FMHA #1213

Closed

cameronshinn added 2 commits July 21, 2024 17:52

Pipeline fixes to masking

6e4747d

Merge branch 'develop' into ck_tile/bsp_fmha

8ae6661

poyenc assigned cameronshinn Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK Tile] Generic attention masking support for FMHA fwd and bwd #1340

[CK Tile] Generic attention masking support for FMHA fwd and bwd #1340

cameronshinn commented Jun 14, 2024

[CK Tile] Generic attention masking support for FMHA fwd and bwd #1340

Are you sure you want to change the base?

[CK Tile] Generic attention masking support for FMHA fwd and bwd #1340

Conversation

cameronshinn commented Jun 14, 2024