Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: graph: document for complex fusions #2278

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
107 changes: 107 additions & 0 deletions doc/graph/fusion_patterns/fusion_patterns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
Fusion Patterns {#dev_guide_graph_fusion_patterns}
==================================================

## Overview

The following fusion patterns are subgraphs that the oneDNN Graph API recognizes
as candidate for fusion. The patterns are described using oneDNN Graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
as candidate for fusion. The patterns are described using oneDNN Graph
as candidates for fusion. The patterns are described using the oneDNN Graph

operation (op) names with the following convention.

@note oneDNN Graph performs limited input validation to minimize the performance
overheads. The application is responsible for sanitizing inputs passed to the
library. For large u8 or s8 inputs may lead to accumulator overflow, you can use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
library. For large u8 or s8 inputs may lead to accumulator overflow, you can use
library. Because large `u8` or `s8` inputs may lead to accumulator overflow, you can use

floating point patterns instead of quantized patterns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
floating point patterns instead of quantized patterns.
floating-point patterns instead of quantized patterns.


`"+"` describes a chain of two ops. The preceding op produces an output tensor,
which is consumed by the following op as its first operand.

`"[]"` describes a component of the overall pattern description. For example,
it could include a subgraph or all the op choices within the bracket.

`"|"` describes choices of multiple operations, say A+[B|C] means the graph
partition contains A followed by B or C.

`","` describes a graph composed of multiple subgraphs, each subgraph marks its
output tensor explicitly, which is consumed by other subgraphs.

`Superscript` denotes the numbers of repetition pattern. For example,
A+[B|C]\f$^{3}\f$ means the graph partition contains A followed by three ops,
each of them is either B or C. The superscript could be a range of number
meaning allowing a range of repetition. If the range is between 0 and 1, we use
superscript `"?"`.

`Subscript` denotes the input and output tensors which need to explicitly mark
the producer and consumer relation within one graph partition. For example,
A\f$_{>t1}\f$+B+C\f$_{<t1}\f$ refers
to the pattern started with A followed by B and C, and C takes an implicit input
tensor from B and an extra tensor t1 output from A. `">"` refers to the output
tensor, and `"<"` for input tensor. Input and output tensor between neighbor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tensor, and `"<"` for input tensor. Input and output tensor between neighbor
tensor, and `"<"` for input tensor. Input and output tensors between neighbor

ops are not explicitly marked, for example, B consumes t1 implicitly in the
example above.

Subscript `"out"` marks the output tensor of a certain op to be the output of
a graph partition. For example, in
A\f$_{>t1}\f$+B\f$_{>out}\f$+C\f$_{<t1,>out}\f$, B's output and C's output
are marked as output tensors.

Subscript `"in"` marks the input tensor of a certain op to be the input of a
graph partition. For example, in A\f$_{<in1}\f$+B\f$_{<in1}\f$ A's input and
B's second input are graph partition input, and they share the same input tensor
in1. Most input tensors of a graph partition are not explicitly marked.
For example, the input tensors of the first op are implicitly regarded as graph
partition inputs. Besides, for input tensors of other ops, if they are not
produced by any proceeding ops, they are regarded as implicit graph partition
inputs. In the example A\f$_{>t1}\f$+B+C\f$_{<t1}\f$, A's inputs are
regarded as implicit graph partition inputs, and if B is a binary operation, the
second input tensor is an implicit graph partition input.

The following categories will be used in describing fusion pattern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following categories will be used in describing fusion pattern.
The following categories will be used in describing a fusion pattern.


Unary = [Abs | Clamp | Elu | Exp | GELU | HardSwish | LeakyReLU |
Log | Sigmoid | SoftPlus | Pow | ReLU | Round | Sqrt | Square | Tanh]

Binary = [Add | Divide | Maximum | Minimum | Multiply | Subtract]

Reduction = [ReduceL1 | ReduceL2 | ReduceMax | ReduceMean | ReduceMin |
ReduceProd | ReduceSum]

### Inference

#### Floating Point Patterns

| Pattern | Description |
|:--------|:-----------------------------|
| Scaled Dot-Product Attention | Refer to @ref dev_guide_graph_sdpa for more details. |
| Grouped Query Attention | Refer to @ref dev_guide_graph_gqa for more details. |
| Gated Multi-Layer Perceptron (Gated-MLP) | Refer to @ref dev_guide_graph_gated_mlp for more details. |
| Convolution + BiasAdd\f$^?\f$ + BatchNormInference\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks, for example ResNet, ResNext, SSD, etc. |
| ConvTranspose + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks. |
| Interpolate + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used for image processing. |
| MatMul + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$ + Select\f$^?\f$\f$_{>out}\f$ | This pattern is widely used in language models and recommendation models, for example BERT, DLRM, etc. |
| Reduction + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used for data processing, for example loss reduction. |
| Unary + Binary\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks. |
| Binary + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks, for example ParallelWaveGAN. |
| [AvgPool \| MaxPool] + Binary\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks. |
| BatchNormInference + ReLU\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks, for example DenseNet. |
| Reciprocal + Multiply\f$_{>out}\f$ | N/A |
| Reorder + Add\f$_{>out}\f$ | N/A |

#### Quantized Patterns

| Pattern | Description |
|:--------|:-----------------------------|
| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + Convolution\f$_{<t1}\f$ + BiasAdd\f$^?\f$ + [Unary \| Binary\f$_{<t2}\f$]\f$^{0-3}\f$ + Quantize\f$^?\f$\f$_{>out}\f$ | N/A |
| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + ConvTranspose\f$_{<t1}\f$ + BiasAdd\f$^?\f$ + [Unary \| Binary\f$_{<t2}\f$]\f$^{0-3}\f$ + Quantize\f$^?\f$\f$_{>out}\f$ |N/A |
| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + MatMul\f$_{<t1}\f$ + BiasAdd\f$^?\f$ + [Unary \| Binary\f$_{<t2}\f$]\f$^{0-3}\f$ + Select\f$^?\f$ + Quantize\f$^?\f$\f$_{>out}\f$ |N/A |
| Dequantize + [AvgPool \| MaxPool] + Quantize\f$_{>out}\f$ |N/A |
| Dequantize\f$_{>t1}\f$, Dequantize + [AvgPool \| MaxPool] + Add\f$_{<t1}\f$ + Quantize\f$_{>out}\f$ |N/A |
| Dequantize + Reorder + Quantize\f$_{>out}\f$ |N/A |
| Dequantize\f$_{>t1}\f$, Dequantize + Reorder + Add\f$_{<t1}\f$ + Quantize\f$_{>out}\f$ |N/A |
| [SoftMax \| LayerNorm \| GroupNorm] + [Unary \| Binary\f$_{<t2}\f$]\f$^{0-3}\f$ + Quantize\f$^?\f$\f$_{>out}\f$ | This pattern is used in SmoothQuant to fuse scales and quantization into previous layers |

### Training

| Pattern | Description |
|:--------|:-----------------------------|
| ConvolutionBackwardWeights + BiasAddBackward\f$_{>out}\f$ | N/A |
| ReLUBackward + BatchNormTrainingBackward\f$_{>out}\f$ |N/A |
123 changes: 123 additions & 0 deletions doc/graph/fusion_patterns/gated_mlp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
Gated Multi-Layer Perceptron (Gated-MLP) {#dev_guide_graph_gated_mlp}
=====================================================================

## Overview

Gated Multi-Layer Perceptron (Gated-MLP) is a variant of MLP which is widely
used as the Feed Forward Network (FFN) in many Transformer-based Large Language
Models (LLMs).

Typically, the FFN in Transformer architecture [1] is defined as a two layer MLP
with a ReLU activation in between which can be replaced with other activations.

\f[

FFN(src,W,V) = ReLU(src \cdot W) \cdot V

\f]

Gated Linear Unit (GLU) is adopted to replace the first linear layer to
improve the quality of Transformer-based models [2]:

\f[

GLU(src,W_1,W_2) = (src \cdot W_1) \otimes Sigmoid(src \cdot W_2) \\

FFN(src,W_1,W_2,V) = GLU(src,W_1,W_2) \cdot V

\f]

Where the \f$ src \cdot W_1 \f$ is usually called "FC (fully-connected) up",
\f$ src \cdot W_2 \f$ is called "FC gate", and the last linear is called
"FC down".

Swish activation is further adopted to replace Sigmoid in the GLU to form
swiGLU.

\f[

Swish(x) = x \otimes Sigmoid(x) \\

swiGLU(src,W_1,W_2) = (src \cdot W_1) \otimes Swish(src \cdot W_2) \\

FFN(src,W_1,W_2,V) = swiGLU(src,W_1,W_2) \cdot V

\f]

The Gated-MLP based on swiGLU is also adopted in LLMs like LLaMA [3], Qwen [4],
etc.

## Gated-MLP patterns

oneDNN supports Gated-MLP and its optimization through Graph API [5] by defining
the graph, getting partition from the graph, and optimizing the kernels
underneath. In general, a Gated-MLP pattern is defined as a directional acyclic
graph (DAG) using oneDNN Graph API.

### Floating-point Gated-MLP

oneDNN defines floating-point (f32, bf16, and f16) Gated-MLP as follows. The blue
nodes are required when defining a Gated-MLP pattern while the brown nodes are
optional.

![Gated-MLP pattern](images/fp-gated-mlp.png)

1. The first MatMul on the top left calculates "FC up": \f$ src \cdot W_1 \f$.
See [MatMul](@ref dev_guide_op_matmul) operation in Graph API.
2. The second MatMul on the top right calculates "FC gate": \f$ src \cdot W_2 \f$.
3. The Activation node is optional. If required, it can be constructed with the
activation operations in Graph API, for example, [ReLU](@ref dev_guide_op_relu),
[GELU](@ref dev_guide_op_gelu), [Sigmoid](@ref dev_guide_op_sigmoid), and so on.
For Swish activation, the node can be constructed with the [Sigmoid](@ref dev_guide_op_sigmoid)
and [Multiply](@ref dev_guide_op_multiply) as below. You can also refer the
[Gated-MLP example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp.cpp)
for Swish definition.

![Swish Activation](images/gated-mlp-swish.png)

4. The last MatMul on the bottom performs the "FC down" operation between the
GLU output and \f$V\f$.

## Data Types

oneDNN supports the floating-point Gated-MLP pattern with data types f32, bf16,
and f16. You can specify the data type via the input and output data type fields
of logical tensors for each operation. oneDNN does not support mixing different
floating data types in a floating-point Gated-MLP pattern.

The definition of the data types and support status on different CPU and GPU
platforms follow the general description in @ref dev_guide_data_types.

## Implementation limitations

1. oneDNN primitive-based Gated-MLP is implemented as the reference
implementation on both Intel Architecture Processors and Intel Graphics
Products. In this case, floating-point Gated-MLP patterns are usually
implemented with three f32, bf16, or f16 matmul (with binary or eltwise
post-ops) primitives.
2. The Gated-MLP patterns functionally supports all input shapes meeting the
shape requirements of each operation in the graph. For example, the `MatMul`
operation requires shape consistency for `k` dimension. The `Multiply`
operation requires the input tensors to have the same shape or the shapes can
be properly broadcasted based on the operation attribute.

## Examples

oneDNN provides a [Gated-MLP
example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp.cpp)
demonstrating how to construct a typical floating-point Gated-MLP pattern with
oneDNN Graph API on CPU and GPU with different runtimes.

For applications where the weights of FC up and FC gate are combined as a single
tensor, oneDNN also provides an
[example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp_wei_combined.cpp)
demonstrating how to create the weight tensors for the pattern with the offsets
and strides from the combined weight tensor.

## References

1. Attention is all you need, https://arxiv.org/abs/1706.03762v7
2. GLU Variants Improve Transformer, https://arxiv.org/abs/2002.05202
3. LLaMA: Open and Efficient Foundation Language Models, https://arxiv.org/abs/2302.13971
4. Qwen Technical Report, https://arxiv.org/abs/2309.16609
5. oneDNN Graph API documentation, https://oneapi-src.github.io/oneDNN/graph_extension.html
106 changes: 106 additions & 0 deletions doc/graph/fusion_patterns/gqa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
Grouped Query Attention (GQA) {#dev_guide_graph_gqa}
====================================================

## Overview

In a typical Scaled Dot-Product Attention (SDPA) [1], the input Query, Key, and
Value tensors have the same head number. It becomes a performance bottleneck to
load the Key and Value tensors in each generation step, especially when the
sentence length gets longer.

To reduce the memory bandwidth overhead of loading the Key and Value tensors,
Multi-Query Attention (MQA) [2] is created by reducing the head number of Key
and Value tensors to one which means multiple Queries will map to the same
single Key and Value tensor. However, MQA may lead to model quality degradation
and training instability. Therefore, Grouped-Query Attention (GQA) [3], an
interpolation between the typical SDPA and MQA, is proposed with single Key and
Value head per a subgroup of Query heads. The head number of Key and Value
equals to the group number of Query heads.

The notations used in the document:

- N: the mini-batch size.
- H_q: the head number of Query.
- H_kv: the head number of Key or Value.
- N_rep: H_q / H_kv, indicates how many Query heads are mapped to one Key head.
- S: the sequence length.
- D: the size of each head.

## GQA pattern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## GQA pattern
## GQA Pattern


Similar to how SDPA is supported, the GQA pattern is also defined as a
directional acyclic graph (DAG) using oneDNN Graph API. oneDNN extends the
[SDPA pattern](@ref dev_guide_graph_sdpa) to support floating-point (f32, bf16,
and f16) GQA as follows. The blue nodes are required when defining a GQA pattern
while the brown nodes are optional.

![GQA pattern](images/gqa.png)

Compared to a typical SDPA pattern, there are a few differences in the GQA
pattern:

1. The input Query has shape (N, H_q, S, D). It will be reshaped to (N, H_kv,
N_rep, S, D) by splitting H_q dimension into H_kv and N_rep. The reshaping
can be constructed using the [StaticReshape](@ref dev_guide_op_staticreshape)
operation in Graph API.
2. Similarly, the input Key and Value have shape (N, H_kv, S, D). They will be
reshaped to (N, H_kv, 1, S, D) to meet the input shape requirement of
[MatMul](@ref dev_guide_op_matmul) operation.
3. The second MatMul calculates the dot products between the probabilities after
SoftMax and Value nodes and generates output with shape (N, H_kv, N_rep, S, D).
4. Another StaticReshape operation is applied to the output of the second MatMul
to convert the shape into (N, H_q, S, D) by combining H_kv and N_rep
dimensions.
5. The input scale factor and mask in the pattern also need to meet the
operations' shape requirement which can be achieved through StaticReshape
similarly. Besides that, they have the same definition as described in the
typical SDPA pattern.

## Data Types

oneDNN supports the floating-point GQA pattern with data types f32, bf16, and
f16. You can specify the data type via the input and output data type fields of
logical tensors for each operation. oneDNN does not support mixing different
floating data types in a floating-point GQA pattern.

The definition of the data types and support status on different CPU and GPU
platforms follow the general description in @ref dev_guide_data_types.

## Implementation limitations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Implementation limitations
## Implementation Limitations


1. oneDNN primitive-based GQA is implemented as the reference implementation on
both Intel Architecture Processors and Intel Graphics Products. The reference
implementation requires memory to store the intermediate results of the dot
products between Query and Key which takes \f$O(S^2)\f$ memory. It may lead
to Out-of-Memory error when computing long sequence length input on platforms with
limited memory.
2. The GQA patterns functionally support all input shapes meeting the shape
requirements of each operation in the graph.
3. CPU
- Optimized implementation is available for 4D Q/K/V tensors with shape
defined as (N, H_q, S, D) for Query and (N, H_kv, S, D) for Key and Value.
- Optimized implementation is available for OpenMP runtime and Threadpool
runtime on Intel Architecture Processors.
- Specifically for OpenMP runtime, the optimized implementation requires `N *
H_q > 2 * thread number` to get enough parallelism.
4. GPU
- Optimized implementation is available for 4D Q/K/V tensors with shape
defined as (N, H_q, S, D) for Query and (N, H_kv, S, D) for Key and Value.
- Optimized implementation is available for floating-point GQA with `f16`
data type and `D <= 256` on Intel Graphics Products with Intel(R) Xe Matrix
Extensions (Intel(R) XMX) support.

## Example

oneDNN provides a [GQA
example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gqa.cpp)
demonstrating how to construct a floating-point GQA pattern with oneDNN Graph
API on CPU and GPU with different runtimes.

## References

[1] Attention is all you need, https://arxiv.org/abs/1706.03762v7

[2] Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150

[3] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/graph/fusion_patterns/images/gqa.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading