Skip to content

godaai/llm-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

LLM Inference

Table of Contents

Glossary and Illustration

  • Llama Architecture: stacks of N Transformer decoders; each decoder consists of Grouped-Query Attention (GQA), Rotary Position Embedding (RoPE), Residual Add, Root Mean Square Layer Normalization (RMSNorm), and Multi-Layer Perceptron (MLPs).

Llama 2

  • Prompt: the initial text or instruction given to the model.
  • Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
  • Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
  • Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
  • KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
  • Weight: the parameter of the model, the $w$ in $y = w \cdot x + b$.
  • Activation: the output of a neuron, which is computed using an activation function, the $z$ in $z = f(y)$, where $f$ is the activation function like ReLU.
  • GPU Kernel: function that is executed on multiple GPU computing cores to perform parallel computations.
  • HBM (High Bandwidth Memory): a type of advanced memory technology, which is like the main memory of data-center GPUs.
  • Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
  • Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
  • Post-Training Quantization (PTQ): quantizing the weights and activations of the model after the model has been trained.
  • Quantization-Aware Training (QAT): incorporating quantization considerations during training.
  • W8A8: quantizing both weights and activations into 8-bit INT8; W16A16, W8A16, and similar terms follow the same pattern.

Open Source Software

Name Stars Hardware Org
Transformers CPU / NVIDIA GPU / TPU / AMD GPU Hugging Face
Text Generation Inference CPU / NVIDIA GPU / AMD GPU Hugging Face
gpt-fast CPU / NVIDIA GPU / AMD GPU PyTorch
TensorRT-LLM NVIDIA GPU NVIDIA
vLLM NVIDIA GPU University of California, Berkeley
llama.cpp / ggml CPU / Apple Silicon / NVIDIA GPU / AMD GPU ggml
ctransformers CPU / Apple Silicon / NVIDIA GPU / AMD GPU Ravindra Marella
DeepSpeed CPU / NVIDIA GPU Microsoft
FastChat CPU / NVIDIA GPU / Apple Silicon lmsys.org
MLC-LLM CPU / NVIDIA GPU MLC
LightLLM CPU / NVIDIA GPU SenseTime
LMDeploy CPU / NVIDIA GPU Shanghai AI Lab & SenseTime
PowerInfer CPU / NVIDIA GPU Shanghai Jiao Tong University
OpenLLM CPU / NVIDIA GPU / AMD GPU BentoML
OpenPPL.nn / OpenPPL.nn.llm CPU / NVIDIA GPU OpenMMLab & SenseTime
ScaleLLM NVIDIA GPU Vectorch
RayLLM CPU / NVIDIA GPU / AMD GPU Anyscale
Xorbits Inference CPU / NVIDIA GPU / AMD GPU Xorbits

Paper List

Name Paper Title Paper Link Artifact Keywords Recommend
GPT-3 Language Models are Few-Shot Learners NuerIPS 20 LLM / Pre-training ⭐️⭐️⭐️⭐️
LLaMA LLaMA: Open and Efficient Foundation Language Models arXiv 23 Code LLM / Pre-training ⭐️⭐️⭐️⭐️
Llama 2 Llama 2: Open Foundation and Fine-Tuned Chat Models arXiv 23 Model LLM / Pre-training / Fine-tuning / Safety ⭐️⭐️⭐️⭐️
MQA Fast Transformer Decoding: One Write-Head is All You Need arXiv 19 Multi-Query Attention ⭐️⭐️⭐️
GQA GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints arXiv 23 Grouped-Query Attention ⭐️⭐️⭐️⭐️
RoPE Roformer: Enhanced transformer with rotary position embedding arXiv 21 Rotary Position Embedding ⭐️⭐️⭐️⭐️
Megatron-LM Efficient large-scale language model training on GPU clusters using megatron-LM SC 21 Code Tensor Parallel / Pipeline Parallel ⭐️⭐️⭐️⭐️⭐️
Alpa Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI 22 Code Automatic Parallel ⭐️⭐️⭐️
Gpipe GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism NeurIPS 19 Pipeline Parallel ⭐️⭐️⭐️
Google's Practice Efficiently Scaling Transformer Inference MLSys 23 Partition ⭐️⭐️⭐️⭐️
FlashAttention Fast and Memory-Efficient Exact Attention with IO-Awareness NeurIPS 23 Code Memory Hierachy / Softmax Tiling ⭐️⭐️⭐️⭐️⭐️
Orca Orca: A distributed serving system for Transformer-Based generative models OSDI 22 Code Continuous Batching ⭐️⭐️⭐️⭐️⭐️
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention SOSP 23 Code GPU Memory Paging ⭐️⭐️⭐️⭐️⭐️
FlexGen FlexGen: High-throughput generative inference of large language models with a single GPU ICML 23 Code Offloading ⭐️⭐️⭐️
Speculative Decoding Fast Inference from Transformers via Speculative Decoding ICML 23 Speculative Decoding ⭐️⭐️⭐️⭐️
LLM.int8() LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale NeurIPS 22 Code Mixed-Precision Quantization ⭐️⭐️⭐️⭐️
ZeroQuant ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers NeurIPS 22 Code Group-wise and Token-wise Quantization ⭐️⭐️⭐️⭐️
SmoothQuant SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML 23 Quantization by Scaling ⭐️⭐️⭐️⭐️
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration arXiv 23 Code Activation-aware and Scaling ⭐️⭐️⭐️⭐️
GPTQ GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR 23 Code Optimal Brain Quantization ⭐️⭐️⭐️⭐️
FP8 FP8 Formats for Deep Learning arXiv 22 FP8 format ⭐️⭐️⭐️
Wanda A Simple and Effective Pruning Approach for Large Language Models ICLR 24 Code Pruning by Weights and activations ⭐️⭐️⭐️⭐️
Deja Vu Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML 23 Code Pruning based on Contextual Sparsity ⭐️⭐️⭐️
PowerInfer PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU arXiv 23 Code Deja Vu + CPU Offloading ⭐️⭐️⭐️