Replies: 4 comments 1 reply
-
AMMO URL: |
Beta Was this translation helpful? Give feedback.
-
This is a much needed feature. The benefits of supporting this dtype can be seen here. |
Beta Was this translation helpful? Give feedback.
-
Bump for this 👍 |
Beta Was this translation helpful? Give feedback.
-
@HaiShaw as for tensor scaling , are supporting log2 scaling in quark tool and export it as a file? I am asking this because FP8 has better precision near 0 (with 1e-4 precision for fp8_e4m3fnuz), usually we are usualy scaling it to [-32, 32] before quantized as fp8 for better precisions. Simply scaling fp16 with tensorwise scalar by using this equation won't create best nermeric accuracy in PTQ:
Hence we need to develop a routine :
This simply because as you known, fp8 is in non-uniform distribution. Wish this question draw your attention. |
Beta Was this translation helpful? Give feedback.
-
This RFC is to facilitate the community to enable new FP8 data type to vLLM for the benefits to both memory bandwidth and computation throughput (on FP8 capable hardware: AMD MI300, nVIDIA H100, etc.)
fp16/half precision is used exclusively as higher precision example, but same specs. apply to bfloat16, fp32, etc.
Support loading FP8 quantized model from AMMO or similar quantizer, quantized model includes:
Support OCP e4m3 as FP8 data type during inference
Per-Tensor Scaling is required
FP8 Tensor Core computation (e4m3 gemm) feasibility:
Support both AMD and nVIDIA hardware:
Computation kernel with FP8 input
*1/S
(inverse scaling factor) for each FP8 inputReference
RFC: FP8 Quantization Schema in vLLM #3218
Beta Was this translation helpful? Give feedback.
All reactions