oneapi-src · TaoLv · Oct 25, 2024 · Nov 12, 2024 · Sep 3, 2024 · Nov 21, 2024
@@ -0,0 +1,107 @@
+Fusion Patterns {#dev_guide_graph_fusion_patterns}
+==================================================
+
+## Overview
+
+The following fusion patterns are subgraphs that the oneDNN Graph API recognizes
+as candidate for fusion. The patterns are described using oneDNN Graph
-as candidate for fusion. The patterns are described using oneDNN Graph
+as candidates for fusion. The patterns are described using the oneDNN Graph
-as candidate for fusion. The patterns are described using oneDNN Graph
+as candidates for fusion. The patterns are described using the oneDNN Graph
+operation (op) names with the following convention.
+
+@note oneDNN Graph performs limited input validation to minimize the performance
+overheads. The application is responsible for sanitizing inputs passed to the
+library. For large u8 or s8 inputs may lead to accumulator overflow, you can use
-library. For large u8 or s8 inputs may lead to accumulator overflow, you can use
+library. Because large `u8` or `s8` inputs may lead to accumulator overflow, you can use
-library. For large u8 or s8 inputs may lead to accumulator overflow, you can use
+library. Because large `u8` or `s8` inputs may lead to accumulator overflow, you can use
+floating point patterns instead of quantized patterns.
-floating point patterns instead of quantized patterns.
+floating-point patterns instead of quantized patterns.
-floating point patterns instead of quantized patterns.
+floating-point patterns instead of quantized patterns.
+
+`"+"` describes a chain of two ops. The preceding op produces an output tensor,
+which is consumed by the following op as its first operand.
+
+`"[]"` describes a component of the overall pattern description. For example,
+it could include a subgraph or all the op choices within the bracket.
+
+`"|"` describes choices of multiple operations, say A+[B|C] means the graph
+partition contains A followed by B or C.
+
+`","` describes a graph composed of multiple subgraphs, each subgraph marks its
+output tensor explicitly, which is consumed by other subgraphs.
+
+`Superscript` denotes the numbers of repetition pattern. For example,
+A+[B|C]\f$^{3}\f$ means the graph partition contains A followed by three ops,
+each of them is either B or C. The superscript could be a range of number
+meaning allowing a range of repetition. If the range is between 0 and 1, we use
+superscript `"?"`.
+
+`Subscript` denotes the input and output tensors which need to explicitly mark
+the producer and consumer relation within one graph partition. For example,
+A\f$_{>t1}\f$+B+C\f$_{<t1}\f$ refers
+to the pattern started with A followed by B and C, and C takes an implicit input
+tensor from B and an extra tensor t1 output from A. `">"` refers to the output
+tensor, and `"<"` for input tensor.  Input and output tensor between neighbor
-tensor, and `"<"` for input tensor.  Input and output tensor between neighbor
+tensor, and `"<"` for input tensor.  Input and output tensors between neighbor
-tensor, and `"<"` for input tensor.  Input and output tensor between neighbor
+tensor, and `"<"` for input tensor.  Input and output tensors between neighbor
+ops are not explicitly marked, for example, B consumes t1 implicitly in the
+example above.
+
+Subscript `"out"` marks the output tensor of a certain op to be the output of
+a graph partition. For example, in
+A\f$_{>t1}\f$+B\f$_{>out}\f$+C\f$_{<t1,>out}\f$, B's output and C's output
+are marked as output tensors.
+
+Subscript `"in"` marks the input tensor of a certain op to be the input of a
+graph partition. For example, in A\f$_{<in1}\f$+B\f$_{<in1}\f$ A's input and
+B's second input are graph partition input, and they share the same input tensor
+in1. Most input tensors of a graph partition are not explicitly marked.
+For example, the input tensors of the first op are implicitly regarded as graph
+partition inputs. Besides, for input tensors of other ops, if they are not
+produced by any proceeding ops, they are regarded as implicit graph partition
+inputs. In the example A\f$_{>t1}\f$+B+C\f$_{<t1}\f$, A's inputs are
+regarded as implicit graph partition inputs, and if B is a binary operation, the
+second input tensor is an implicit graph partition input.
+
+The following categories will be used in describing fusion pattern.
-The following categories will be used in describing fusion pattern.
+The following categories will be used in describing a fusion pattern.
-The following categories will be used in describing fusion pattern.
+The following categories will be used in describing a fusion pattern.
+
+Unary = [Abs | Clamp | Elu | Exp | GELU | HardSwish | LeakyReLU |
+Log | Sigmoid | SoftPlus | Pow | ReLU | Round | Sqrt | Square | Tanh]
+
+Binary = [Add | Divide | Maximum | Minimum | Multiply | Subtract]
+
+Reduction = [ReduceL1 | ReduceL2 | ReduceMax | ReduceMean | ReduceMin |
+ReduceProd | ReduceSum]
+
+### Inference
+
+#### Floating Point Patterns
+
+| Pattern | Description                  |
+|:--------|:-----------------------------|
+| Scaled Dot-Product Attention | Refer to @ref dev_guide_graph_sdpa for more details. |
+| Grouped Query Attention | Refer to @ref dev_guide_graph_gqa for more details. |
+| Gated Multi-Layer Perceptron (Gated-MLP) | Refer to @ref dev_guide_graph_gated_mlp for more details. |
+| Convolution + BiasAdd\f$^?\f$ + BatchNormInference\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks, for example ResNet, ResNext, SSD, etc. |
+| ConvTranspose + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks. |
+| Interpolate + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used for image processing. |
+| MatMul + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$ + Select\f$^?\f$\f$_{>out}\f$ | This pattern is widely used in language models and recommendation models, for example BERT, DLRM, etc. |
+| Reduction + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used for data processing, for example loss reduction. |
+| Unary + Binary\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks. |
+| Binary + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks, for example ParallelWaveGAN. |
+| [AvgPool \| MaxPool] + Binary\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks. |
+| BatchNormInference + ReLU\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks, for example DenseNet. |
+| Reciprocal + Multiply\f$_{>out}\f$ | N/A |
+| Reorder + Add\f$_{>out}\f$ | N/A |
+
+#### Quantized Patterns
+
+| Pattern | Description                  |
+|:--------|:-----------------------------|
+| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + Convolution\f$_{<t1}\f$ + BiasAdd\f$^?\f$ + [Unary \| Binary\f$_{<t2}\f$]\f$^{0-3}\f$ + Quantize\f$^?\f$\f$_{>out}\f$ | N/A |
+| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + ConvTranspose\f$_{<t1}\f$ + BiasAdd\f$^?\f$ + [Unary \| Binary\f$_{<t2}\f$]\f$^{0-3}\f$ + Quantize\f$^?\f$\f$_{>out}\f$ |N/A |
+| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + MatMul\f$_{<t1}\f$ + BiasAdd\f$^?\f$ + [Unary \| Binary\f$_{<t2}\f$]\f$^{0-3}\f$ + Select\f$^?\f$ + Quantize\f$^?\f$\f$_{>out}\f$ |N/A |
+| Dequantize + [AvgPool \| MaxPool] + Quantize\f$_{>out}\f$ |N/A |
+| Dequantize\f$_{>t1}\f$, Dequantize + [AvgPool \| MaxPool] + Add\f$_{<t1}\f$ + Quantize\f$_{>out}\f$ |N/A |
+| Dequantize + Reorder + Quantize\f$_{>out}\f$ |N/A |
+| Dequantize\f$_{>t1}\f$, Dequantize + Reorder + Add\f$_{<t1}\f$ + Quantize\f$_{>out}\f$ |N/A |
+| [SoftMax \| LayerNorm \| GroupNorm] + [Unary \| Binary\f$_{<t2}\f$]\f$^{0-3}\f$ + Quantize\f$^?\f$\f$_{>out}\f$ | This pattern is used in SmoothQuant to fuse scales and quantization into previous layers |
+
+### Training
+
+| Pattern | Description                  |
+|:--------|:-----------------------------|
+| ConvolutionBackwardWeights + BiasAddBackward\f$_{>out}\f$ | N/A |
+| ReLUBackward + BatchNormTrainingBackward\f$_{>out}\f$ |N/A |
@@ -0,0 +1,123 @@
+Gated Multi-Layer Perceptron (Gated-MLP) {#dev_guide_graph_gated_mlp}
+=====================================================================
+
+## Overview
+
+Gated Multi-Layer Perceptron (Gated-MLP) is a variant of MLP which is widely
+used as the Feed Forward Network (FFN) in many Transformer-based Large Language
+Models (LLMs).
+
+Typically, the FFN in Transformer architecture [1] is defined as a two layer MLP
+with a ReLU activation in between which can be replaced with other activations.
+
+\f[
+
+    FFN(src,W,V) = ReLU(src \cdot W) \cdot V
+
+\f]
+
+Gated Linear Unit (GLU) is adopted to replace the first linear layer to
+improve the quality of Transformer-based models [2]:
+
+\f[
+
+    GLU(src,W_1,W_2) = (src \cdot W_1) \otimes Sigmoid(src \cdot W_2) \\
+
+    FFN(src,W_1,W_2,V) = GLU(src,W_1,W_2) \cdot V
+
+\f]
+
+Where the \f$ src \cdot W_1 \f$ is usually called "FC (fully-connected) up",
+\f$ src \cdot W_2 \f$ is called "FC gate", and the last linear is called
+"FC down".
+
+Swish activation is further adopted to replace Sigmoid in the GLU to form
+swiGLU.
+
+\f[
+
+    Swish(x) = x \otimes Sigmoid(x) \\
+
+    swiGLU(src,W_1,W_2) = (src \cdot W_1) \otimes Swish(src \cdot W_2) \\
+
+    FFN(src,W_1,W_2,V) = swiGLU(src,W_1,W_2) \cdot V
+
+\f]
+
+The Gated-MLP based on swiGLU is also adopted in LLMs like LLaMA [3], Qwen [4],
+etc.
+
+## Gated-MLP patterns
+
+oneDNN supports Gated-MLP and its optimization through Graph API [5] by defining
+the graph, getting partition from the graph, and optimizing the kernels
+underneath. In general, a Gated-MLP pattern is defined as a directional acyclic
+graph (DAG) using oneDNN Graph API.
+
+### Floating-point Gated-MLP
+
+oneDNN defines floating-point (f32, bf16, and f16) Gated-MLP as follows. The blue
+nodes are required when defining a Gated-MLP pattern while the brown nodes are
+optional.
+
+![Gated-MLP pattern](images/fp-gated-mlp.png)
+
+1. The first MatMul on the top left calculates "FC up": \f$ src \cdot W_1 \f$.
+   See [MatMul](@ref dev_guide_op_matmul) operation in Graph API.
+2. The second MatMul on the top right calculates "FC gate": \f$ src \cdot W_2 \f$.
+3. The Activation node is optional. If required, it can be constructed with the
+   activation operations in Graph API, for example, [ReLU](@ref dev_guide_op_relu),
+   [GELU](@ref dev_guide_op_gelu), [Sigmoid](@ref dev_guide_op_sigmoid), and so on.
+   For Swish activation, the node can be constructed with the [Sigmoid](@ref dev_guide_op_sigmoid)
+   and [Multiply](@ref dev_guide_op_multiply) as below. You can also refer the
+   [Gated-MLP example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp.cpp)
+   for Swish definition.
+
+   ![Swish Activation](images/gated-mlp-swish.png)
+
+4. The last MatMul on the bottom performs the "FC down" operation between the
+   GLU output and \f$V\f$.
+
+## Data Types
+
+oneDNN supports the floating-point Gated-MLP pattern with data types f32, bf16,
+and f16. You can specify the data type via the input and output data type fields
+of logical tensors for each operation. oneDNN does not support mixing different
+floating data types in a floating-point Gated-MLP pattern.
+
+The definition of the data types and support status on different CPU and GPU
+platforms follow the general description in @ref dev_guide_data_types.
+
+## Implementation limitations
+
+1. oneDNN primitive-based Gated-MLP is implemented as the reference
+   implementation on both Intel Architecture Processors and Intel Graphics
+   Products. In this case, floating-point Gated-MLP patterns are usually
+   implemented with three f32, bf16, or f16 matmul (with binary or eltwise
+   post-ops) primitives.
+2. The Gated-MLP patterns functionally supports all input shapes meeting the
+   shape requirements of each operation in the graph. For example, the `MatMul`
+   operation requires shape consistency for `k` dimension. The `Multiply`
+   operation requires the input tensors to have the same shape or the shapes can
+   be properly broadcasted based on the operation attribute.
+
+## Examples
+
+oneDNN provides a [Gated-MLP
+example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp.cpp)
+demonstrating how to construct a typical floating-point Gated-MLP pattern with
+oneDNN Graph API on CPU and GPU with different runtimes.
+
+For applications where the weights of FC up and FC gate are combined as a single
+tensor, oneDNN also provides an
+[example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp_wei_combined.cpp)
+demonstrating how to create the weight tensors for the pattern with the offsets
+and strides from the combined weight tensor.
+
+## References
+
+1. Attention is all you need, https://arxiv.org/abs/1706.03762v7
+2. GLU Variants Improve Transformer, https://arxiv.org/abs/2002.05202
+3. LLaMA: Open and Efficient Foundation Language Models, https://arxiv.org/abs/2302.13971
+4. Qwen Technical Report, https://arxiv.org/abs/2309.16609
+5. oneDNN Graph API documentation, https://oneapi-src.github.io/oneDNN/graph_extension.html
@@ -0,0 +1,106 @@
+Grouped Query Attention (GQA) {#dev_guide_graph_gqa}
+====================================================
+
+## Overview
+
+In a typical Scaled Dot-Product Attention (SDPA) [1], the input Query, Key, and
+Value tensors have the same head number. It becomes a performance bottleneck to
+load the Key and Value tensors in each generation step, especially when the
+sentence length gets longer.
+
+To reduce the memory bandwidth overhead of loading the Key and Value tensors,
+Multi-Query Attention (MQA) [2] is created by reducing the head number of Key
+and Value tensors to one which means multiple Queries will map to the same
+single Key and Value tensor. However, MQA may lead to model quality degradation
+and training instability. Therefore, Grouped-Query Attention (GQA) [3], an
+interpolation between the typical SDPA and MQA, is proposed with single Key and
+Value head per a subgroup of Query heads. The head number of Key and Value
+equals to the group number of Query heads.
+
+The notations used in the document:
+
+- N: the mini-batch size.
+- H_q: the head number of Query.
+- H_kv: the head number of Key or Value.
+- N_rep: H_q / H_kv, indicates how many Query heads are mapped to one Key head.
+- S: the sequence length.
+- D: the size of each head.
+
+## GQA pattern
-## GQA pattern
+## GQA Pattern
-## GQA pattern
+## GQA Pattern
+
+Similar to how SDPA is supported, the GQA pattern is also defined as a
+directional acyclic graph (DAG) using oneDNN Graph API. oneDNN extends the
+[SDPA pattern](@ref dev_guide_graph_sdpa) to support floating-point (f32, bf16,
+and f16) GQA as follows. The blue nodes are required when defining a GQA pattern
+while the brown nodes are optional.
+
+![GQA pattern](images/gqa.png)
+
+Compared to a typical SDPA pattern, there are a few differences in the GQA
+pattern:
+
+1. The input Query has shape (N, H_q, S, D). It will be reshaped to (N, H_kv,
+   N_rep, S, D) by splitting H_q dimension into H_kv and N_rep. The reshaping
+   can be constructed using the [StaticReshape](@ref dev_guide_op_staticreshape)
+   operation in Graph API.
+2. Similarly, the input Key and Value have shape (N, H_kv, S, D). They will be
+   reshaped to (N, H_kv, 1, S, D) to meet the input shape requirement of
+   [MatMul](@ref dev_guide_op_matmul) operation.
+3. The second MatMul calculates the dot products between the probabilities after
+   SoftMax and Value nodes and generates output with shape (N, H_kv, N_rep, S, D).
+4. Another StaticReshape operation is applied to the output of the second MatMul
+   to convert the shape into (N, H_q, S, D) by combining H_kv and N_rep
+   dimensions.
+5. The input scale factor and mask in the pattern also need to meet the
+   operations' shape requirement which can be achieved through StaticReshape
+   similarly. Besides that, they have the same definition as described in the
+   typical SDPA pattern.
+
+## Data Types
+
+oneDNN supports the floating-point GQA pattern with data types f32, bf16, and
+f16. You can specify the data type via the input and output data type fields of
+logical tensors for each operation. oneDNN does not support mixing different
+floating data types in a floating-point GQA pattern.
+
+The definition of the data types and support status on different CPU and GPU
+platforms follow the general description in @ref dev_guide_data_types.
+
+## Implementation limitations
-## Implementation limitations
+## Implementation Limitations
-## Implementation limitations
+## Implementation Limitations
+
+1. oneDNN primitive-based GQA is implemented as the reference implementation on
+   both Intel Architecture Processors and Intel Graphics Products. The reference
+   implementation requires memory to store the intermediate results of the dot
+   products between Query and Key which takes \f$O(S^2)\f$ memory. It may lead
+   to Out-of-Memory error when computing long sequence length input on platforms with
+   limited memory.
+2. The GQA patterns functionally support all input shapes meeting the shape
+   requirements of each operation in the graph.
+3. CPU
+   - Optimized implementation is available for 4D Q/K/V tensors with shape
+     defined as (N, H_q, S, D) for Query and (N, H_kv, S, D) for Key and Value.
+   - Optimized implementation is available for OpenMP runtime and Threadpool
+     runtime on Intel Architecture Processors.
+   - Specifically for OpenMP runtime, the optimized implementation requires `N *
+     H_q > 2 * thread number` to get enough parallelism.
+4. GPU
+   - Optimized implementation is available for 4D Q/K/V tensors with shape
+     defined as (N, H_q, S, D) for Query and (N, H_kv, S, D) for Key and Value.
+   - Optimized implementation is available for floating-point GQA with `f16`
+     data type and `D <= 256` on Intel Graphics Products with Intel(R) Xe Matrix
+     Extensions (Intel(R) XMX) support.
+
+## Example
+
+oneDNN provides a [GQA
+example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gqa.cpp)
+demonstrating how to construct a floating-point GQA pattern with oneDNN Graph
+API on CPU and GPU with different runtimes.
+
+## References
+
+[1] Attention is all you need, https://arxiv.org/abs/1706.03762v7
+
+[2] Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150
+
+[3] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245