RFC: FP8 in vLLM #2461

HaiShaw · 2024-01-17T03:43:34Z

HaiShaw
Jan 17, 2024

This RFC is to facilitate the community to enable new FP8 data type to vLLM for the benefits to both memory bandwidth and computation throughput (on FP8 capable hardware: AMD MI300, nVIDIA H100, etc.)

fp16/half precision is used exclusively as higher precision example, but same specs. apply to bfloat16, fp32, etc.

Support loading FP8 quantized model from AMMO or similar quantizer, quantized model includes:

quantized weights (network parameters) for FP8 feasible layers
scaling factors associated with the quantized weights
scaling factors associated with activations (layer inputs)
scaling factors are pre-computed by AMMO, etc. quantizers via PTQ on calibration subset

Support OCP e4m3 as FP8 data type during inference

OCP (Open Compute Project) FP8 is same as nVIDIA FP8 starting on Hopper
e4m3 should be used for the support of newer hardware (AMD MI300 & nVIDIA H100/200) and multiple vendors
Should leverage FP8 (e4m3) type for parameters or tensor storage wherever feasible, quantized weight, KV cache, etc.
Should leverage FP8 Tensor Core computation on newer hardware (MI300, H100, etc.) for feasile layers
Should use scale based quantization vs. non_scaling cast to quantize wider precision data, example:
- target_e4m3 = fp8_cast(input_tensor_fp16 * scaling_factor)|[email protected]
- scaling_factor = (OCP_E4M3_MAXNORM = 448.0) / AbsMax(input_tensor_fp16)

Per-Tensor Scaling is required

per-channel scaling or other scaling granularity can be considered

FP8 Tensor Core computation (e4m3 gemm) feasibility:

MI300, H100 or newer hardware
Linear layer
MLP/FFN
MHA/MQA/GQA: Q*K

Support both AMD and nVIDIA hardware:

AMD MI300
nVIDIA H100, H200, Ada Lovelace

Computation kernel with FP8 input

when FP8 Matmul instruction is used, final output shall be compensate by *1/S (inverse scaling factor) for each FP8 input
when Matmul is higher precision, inverse scaling factor shall be used to dequant FP8 input prior to compute

Reference

RFC: FP8 Quantization Schema in vLLM #3218

HaiShaw · 2024-01-19T17:48:46Z

HaiShaw
Jan 19, 2024
Author

AMMO URL:
https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.5.0.tar.gz

0 replies

pedrito87 · 2024-02-22T16:38:59Z

pedrito87
Feb 22, 2024

This is a much needed feature. The benefits of supporting this dtype can be seen here.

0 replies

markab21 · 2024-06-07T13:56:51Z

markab21
Jun 7, 2024

Bump for this 👍

0 replies

yiakwy-xpu-ml-framework-team · 2024-09-13T05:26:41Z

yiakwy-xpu-ml-framework-team
Sep 13, 2024

@HaiShaw as for tensor scaling , are supporting log2 scaling in quark tool and export it as a file?

I am asking this because FP8 has better precision near 0 (with 1e-4 precision for fp8_e4m3fnuz), usually we are usualy scaling it to [-32, 32] before quantized as fp8 for better precisions.

Simply scaling fp16 with tensorwise scalar by using this equation won't create best nermeric accuracy in PTQ:

scaling_factor = (OCP_E4M3_MAXNORM = 448.0) / AbsMax(input_tensor_fp16)

Hence we need to develop a routine :

# used in offline weights calibration
def opt_quarter_scalar(fp16):
  min = INF
  log2_scale = 0
  for a in range(-5, 5):
    scale = 2**a * 1.f
    fp8 = fp16_to_fp8(fp16, scale) # cast<Float8, FP8_FMT>(fp16/scale)
    error = fp8_to_fp16(fp8, scale) # cast<Float16>(fp8, FP8_FMT) * scale
    if min > err:
       min = err
       log2_scale = a
  return log2_scale, min

This simply because as you known, fp8 is in non-uniform distribution.

Wish this question draw your attention.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: FP8 in vLLM #2461

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

RFC: FP8 in vLLM #2461

HaiShaw Jan 17, 2024

This RFC is to facilitate the community to enable new FP8 data type to vLLM for the benefits to both memory bandwidth and computation throughput (on FP8 capable hardware: AMD MI300, nVIDIA H100, etc.)

fp16/half precision is used exclusively as higher precision example, but same specs. apply to bfloat16, fp32, etc.

Support loading FP8 quantized model from AMMO or similar quantizer, quantized model includes:

Support OCP e4m3 as FP8 data type during inference

Per-Tensor Scaling is required

FP8 Tensor Core computation (e4m3 gemm) feasibility:

Support both AMD and nVIDIA hardware:

Computation kernel with FP8 input

Reference

Replies: 4 comments · 1 reply

HaiShaw Jan 19, 2024 Author

pedrito87 Feb 22, 2024

markab21 Jun 7, 2024

yiakwy-xpu-ml-framework-team Sep 13, 2024

HaiShaw
Jan 17, 2024

Replies: 4 comments 1 reply

HaiShaw
Jan 19, 2024
Author

pedrito87
Feb 22, 2024

markab21
Jun 7, 2024

yiakwy-xpu-ml-framework-team
Sep 13, 2024