- Llama Architecture: stacks of N Transformer decoders; each decoder consists of Grouped-Query Attention (GQA), Rotary Position Embedding (RoPE), Residual Add, Root Mean Square Layer Normalization (RMSNorm), and Multi-Layer Perceptron (MLPs).
- Prompt: the initial text or instruction given to the model.
- Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
- Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
- Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
- KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
- Weight: the parameter of the model, the
$w$ in$y = w \cdot x + b$ . - Activation: the output of a neuron, which is computed using an activation function, the
$z$ in$z = f(y)$ , where$f$ is the activation function like ReLU. - GPU Kernel: function that is executed on multiple GPU computing cores to perform parallel computations.
- HBM (High Bandwidth Memory): a type of advanced memory technology, which is like the main memory of data-center GPUs.
- Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
- Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
- Post-Training Quantization (PTQ): quantizing the weights and activations of the model after the model has been trained.
- Quantization-Aware Training (QAT): incorporating quantization considerations during training.
- W8A8: quantizing both weights and activations into 8-bit INT8; W16A16, W8A16, and similar terms follow the same pattern.
Name | Stars | Hardware | Org |
---|---|---|---|
Transformers | CPU / NVIDIA GPU / TPU / AMD GPU | Hugging Face | |
Text Generation Inference | CPU / NVIDIA GPU / AMD GPU | Hugging Face | |
gpt-fast | CPU / NVIDIA GPU / AMD GPU | PyTorch | |
TensorRT-LLM | NVIDIA GPU | NVIDIA | |
vLLM | NVIDIA GPU | University of California, Berkeley | |
llama.cpp / ggml | CPU / Apple Silicon / NVIDIA GPU / AMD GPU | ggml | |
ctransformers | CPU / Apple Silicon / NVIDIA GPU / AMD GPU | Ravindra Marella | |
DeepSpeed | CPU / NVIDIA GPU | Microsoft | |
FastChat | CPU / NVIDIA GPU / Apple Silicon | lmsys.org | |
MLC-LLM | CPU / NVIDIA GPU | MLC | |
LightLLM | CPU / NVIDIA GPU | SenseTime | |
LMDeploy | CPU / NVIDIA GPU | Shanghai AI Lab & SenseTime | |
PowerInfer | CPU / NVIDIA GPU | Shanghai Jiao Tong University | |
OpenLLM | CPU / NVIDIA GPU / AMD GPU | BentoML | |
OpenPPL.nn / OpenPPL.nn.llm | CPU / NVIDIA GPU | OpenMMLab & SenseTime | |
ScaleLLM | NVIDIA GPU | Vectorch | |
RayLLM | CPU / NVIDIA GPU / AMD GPU | Anyscale | |
Xorbits Inference | CPU / NVIDIA GPU / AMD GPU | Xorbits |
Name | Paper Title | Paper Link | Artifact | Keywords | Recommend |
---|---|---|---|---|---|
GPT-3 | Language Models are Few-Shot Learners | NuerIPS 20 | LLM / Pre-training | ⭐️⭐️⭐️⭐️ | |
LLaMA | LLaMA: Open and Efficient Foundation Language Models | arXiv 23 | Code | LLM / Pre-training | ⭐️⭐️⭐️⭐️ |
Llama 2 | Llama 2: Open Foundation and Fine-Tuned Chat Models | arXiv 23 | Model | LLM / Pre-training / Fine-tuning / Safety | ⭐️⭐️⭐️⭐️ |
MQA | Fast Transformer Decoding: One Write-Head is All You Need | arXiv 19 | Multi-Query Attention | ⭐️⭐️⭐️ | |
GQA | GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | arXiv 23 | Grouped-Query Attention | ⭐️⭐️⭐️⭐️ | |
RoPE | Roformer: Enhanced transformer with rotary position embedding | arXiv 21 | Rotary Position Embedding | ⭐️⭐️⭐️⭐️ | |
Megatron-LM | Efficient large-scale language model training on GPU clusters using megatron-LM | SC 21 | Code | Tensor Parallel / Pipeline Parallel | ⭐️⭐️⭐️⭐️⭐️ |
Alpa | Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | OSDI 22 | Code | Automatic Parallel | ⭐️⭐️⭐️ |
Gpipe | GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | NeurIPS 19 | Pipeline Parallel | ⭐️⭐️⭐️ | |
Google's Practice | Efficiently Scaling Transformer Inference | MLSys 23 | Partition | ⭐️⭐️⭐️⭐️ | |
FlashAttention | Fast and Memory-Efficient Exact Attention with IO-Awareness | NeurIPS 23 | Code | Memory Hierachy / Softmax Tiling | ⭐️⭐️⭐️⭐️⭐️ |
Orca | Orca: A distributed serving system for Transformer-Based generative models | OSDI 22 | Code | Continuous Batching | ⭐️⭐️⭐️⭐️⭐️ |
PagedAttention | Efficient Memory Management for Large Language Model Serving with PagedAttention | SOSP 23 | Code | GPU Memory Paging | ⭐️⭐️⭐️⭐️⭐️ |
FlexGen | FlexGen: High-throughput generative inference of large language models with a single GPU | ICML 23 | Code | Offloading | ⭐️⭐️⭐️ |
Speculative Decoding | Fast Inference from Transformers via Speculative Decoding | ICML 23 | Speculative Decoding | ⭐️⭐️⭐️⭐️ | |
LLM.int8() | LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | NeurIPS 22 | Code | Mixed-Precision Quantization | ⭐️⭐️⭐️⭐️ |
ZeroQuant | ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers | NeurIPS 22 | Code | Group-wise and Token-wise Quantization | ⭐️⭐️⭐️⭐️ |
SmoothQuant | SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | ICML 23 | Quantization by Scaling | ⭐️⭐️⭐️⭐️ | |
AWQ | AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | arXiv 23 | Code | Activation-aware and Scaling | ⭐️⭐️⭐️⭐️ |
GPTQ | GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | ICLR 23 | Code | Optimal Brain Quantization | ⭐️⭐️⭐️⭐️ |
FP8 | FP8 Formats for Deep Learning | arXiv 22 | FP8 format | ⭐️⭐️⭐️ | |
Wanda | A Simple and Effective Pruning Approach for Large Language Models | ICLR 24 | Code | Pruning by Weights and activations | ⭐️⭐️⭐️⭐️ |
Deja Vu | Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML 23 | Code | Pruning based on Contextual Sparsity | ⭐️⭐️⭐️ |
PowerInfer | PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | arXiv 23 | Code | Deja Vu + CPU Offloading | ⭐️⭐️⭐️ |