Stable API
We provide the stable API support starting from FBGEMM_GPU v1.0.0. This includes Table batched embedding (TBE) modules, Pooled embedding operators and modules, Sparse operators, Jagged tensor operators and Quantization operators.
- API backward compatibility guarantees via thorough testing. We guarantee that our stable APIs will be backward compatible within a major version, meaning that the stable APIs for v1.0.0 will be compatible with every future release unless explicitly announced in advance
*Enhanced documentation, ensuring that every stable API has comprehensive and up-to-date documentation. - Functionality guarantees are only provided through unit testing framework. We do NOT guarantee any functionalities that are NOT explicitly tested and documented in our unit tests.
- No performance guarantees. However, we are committed to providing support on a best-effort basis.
More details can be found in stable API documentation
Highlights
Table Batched Embedding (TBE)
- New optimizer support for TBE Training
- Enhanced Global weight decay support in TBE
- Improvement and bug fixes for TBE training and inference modules and sparse operators
For SSD
- New pipeline prefetching enabled
- New cache and indices related ops
- Integration of L3 cache to TBE operators
- Many improvements to kernel and logging
For CPU
- New type support for CPU Sequence TBE
- Kernel improvements and bug fixes
Generative AI
- Gen AI Ops support and improvement
- Improvements to Triton-based and CUTLASS-based operators
- New and optimized FP8 GEMM and quantization operators
Others
- Optimized MX4 quantization operators
- New dequantization operator
- Removal of python 3.8 Support
Better engineering
- Code refactoring and reorganization for faster builds
- New and improved tests and benchmarks
- Improved AMD support
Software Requirements
FBGEMM_GPU v1.0.0 has been tested and known to work on the following setups:
- PyTorch: v2.5
- CUDA: v11.8, 12.1, 12.4
- Python: v3.9, 3.10, 3.11, 3.12
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.0.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.0.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
For GPU
- [New] Ensemble adagrad optimizer (#3197, #2955, #2954, #3161, #3091, #2981, #2889, #3180, #3158)
- [New] Bounds check in prefetch in TBE training (#3015)
- [New] Method to update internal hyperparameters for FBGEMM TBE (#3025)
- [Improvement] Enhanced Global Weight Decay and state tracking (#2904, #2897, #2882, #2896, #2890, #2884, #2883 )
- [Improvement]
masked_index_*
values index type fix (#2979) - [Improvement] generate_vbe_metadata fixes (#3095, #3087)
- [Improvement] Fixes on the efficiency of VBE TBE forward due to blocking D2H copy (#2862)
- [Improvement] Work around on offsets and indices type mismatch int TBE training (#3037)
- [Improvement] Add a host map option for a UVM tensor alloc (#3073)
- [Improvement]
uvm_to_device
expose device as interface (#3030) - [Improvement] Add Meta backend/dispatcher for new_unified_tensor (#3005)
- [Improvement] General TBE enhancements and bug fixes (#2892, #3114, #3022, #2958)
- [Improvement] Consolidate repeat code in TBE inference (#3028)
For CPU
- [New] Add int4 to int4 CPU Sequence TBE kernel (#2996, #2994)
- [New] Use auto-vec kernel in CPU sequential embedding lookup for int8 tables (#2863, #2878)
- [Improvement] Work around OMP barrier issue with MSVCand unused var error (#2918, #3084)
SSD Table batched embedding (TBE) operators
- [New] Enable pipeline prefetching (#2963)
- [New] Enable cache line locking support in SSD kernel (#2949)
- [New] Add L2 flush (#3110)
- [New] Added SSD ODS and IO/mem stats (#2906, #2913, #3035)
- [New] Add SSDScratchPadIndicesQueue (#2911, #2948)
- [New] Integrate l2 cache to TBE operator (#2959, #3032, #3031 )
- [New] Add ssd_update_row_addrs (#2953)
- [New] Add bounds check in SSD-TBE (#3013)
- [New] Add 32-bit index support in SSD kernels (#3064)
- [New] Add kv cache related ops (#3001, #2968)
- [New] Add compact_indices op (#3075 )
- [New] Create embedding cache interface and impl RocksDB cache (#2858)
- [New] Reduce prefetch SM usage when using pipeline prefetching (#2991)
- [New] Add a host map option for a UVM tensor alloc (#3003)
- [New] Add masked_index_select and refactor masked_index_put (#2910)
- [Improvement] Add parallelism on cache update (#3062)
- [Improvement] add parameter server attributes (#2947)
- [Improvement] Make the scratch pad tensor UVA (#2844)
- [Improvement] Use less thread blocks for find_uncached kernel (#3101)
- [Improvement] Fix stream sync for scratch pad eviction (#2843)
- [Improvement] Make indices related to cache eviction UVA tensors (#3077
- [Improvement] Split cachelib cache into header and src (#3063)
- [Improvement] Record more functions and logging in SSD TBE (#2854, #2867, #2975)
- [Improvement] Attach eviction filling logic to set_cache (#3034)
- [Improvement] Move set_cache and set_async to background thread (#3033)
- [Improvement] Refactoring vec copy in masked_index_put_kernel (#2861, #2908)
- [Improvement] Increase memcpy and compute overlap (#2860)
- [Improvement] Add set_async in background thread (#3036 )
- [Improvement] Make evicted_rows a UVA buffer (#3079 )
- [Improvement] General enhancement and bug fixes (#2937, #2993, #3151, #3089, #2898, #2930)
GenAI Support and Operators
- [New] Decode and Prefill support (#3009 )
- [New] Support rope with block tables (#3146)
- [New] EP support (#3071)
- [New] Implement SDPA kernel wrapper to use run_kernel flow for perf (#2820)
- [Improvement] Move mqa code (#3011)
- [Improvement] BE improvements to init_comms #3103
Triton GEMM support
- [New] Enable torch.compile compatibility for triton fp8 rowwise gemm (#2978)
- [New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
- [New] GEMM custom op enablement (#3046)
- [New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
- [Improvement] Add fused bias to Triton FP8 Rowwise Kernels (#2852)
- [Improvement] Triton dependency ( #3027)
- [Improvement] Fix triton fp8 handling of non-contiguous inputs (#2919)
- [Improvement] More autotune configs and bug fixes in TMA kernel (#3078, #3066, #3072)
- [Improvement] Fp8 gemm tweak for 405B Decoding (#3104 )
FP8 and other Quantization support
- [New] CK FP8 Optimizations and fixes (#2940, #2912, #2987, #3017, (#2893 )
- [New] FP8 kernel development and enablement (#2866)
- [New] GenAI CK Version update and integration (#2865, #2971)
- [Improvement] Also hipify the fp8 related cuda functions (#2834 )
- [Improvement] Auto-generation of CUTLASS Extension Kernel Templates (#2932)
- [Improvement] Marlin Mixed Input Kernel Productionization (#3008)
- [Improvement] Remove redundant torch.abs (#3020, #2822 )
- [Improvement] Tuning for 405B/70B Prefill with small seqlen (#3042)
- [Improvement] Added new instances for 405B decoding (#2936 )
Permute and Pooled Embeddings Ops
- [New] Implementation of permute_multi_embedding (#2833)
- [Improvement] Clean up and removal of unused exception (#2832, #2891)
- [Improvement] Use at::parallel_for in cpu kernel (#2817)
- [Improvement] Add dispatch_to_cpu for the operators (#2874, #2881)
- [Improvement] Print the exact variable values triggering the alert in Merge Pooled Embedding (#3038)
Sparse Operators
- [New] Support original indices for FBGEMM block bucketization flag (#2999, #2925)
- [Improvement] Fix pack_segments backward when grad is non-contig (#3006)
- [Improvement] Fix FBGEMM_GPU_MEMCHECK in sparse_ops_cuda (#2943 )
- [Improvement] Update sparse_ops.py to use generic gpu target fbgemm_gpu:input_combine to support both nvidia and AMD(#2905)
- [Improvement] Add abstract impl and functions (#2962, #2983, #3000 )
- [Improvement] Use guard_size_oblivious in tbe_input_combine_abstract fake kernel (#2923)
- [Improvement] Out variant for asynchronous_exclusive_cumsum_cpu + some more static dispatch kernels (#3090)
Quantize ops
- [New] Add a CPU nbit to float dequantization op that supports torch.quintMxN type (#2995)
MX4 Ops
- [New] Optimize FBGEMM Triton MX4 Quantize-Dequantize (#2838, #2837)
- [New] Rounding Mode Support (#2821, #2816, #2933, #2859 )
- [New] FBGEMM/TorchRec MX4 padding support (#3055, #3047, #3010 )
- [New] Add Stochastic downcasting to MX4 Quantization (#2899)
- [New] Support for other MX4 formats in Triton kernels (#2900)
- [Improvement] Refactor MX4 Kernel to operate on flat tensors (#2836)
- [Improvement] Optimize MX4 padding to minimize need for tuning (#3040)
Benchmarks / Tests
- [New] Add schema compatibility test (#3130)
- [New] Add SSD/UVM caching in TBE device benchmark (#3076)
- [New] Add EmbeddingSpMDM8BitBenchmarkOutTypeFloat16 (#2952 )
- [New] Add benchmark EmbeddingSpMDMNBitBenchmarkOutTypeFloat16 (#2901 )
- [New] Add unit test for int4 to int4 sequence CPU TBE (#2997)
- [New] Add rocm support for fp8 benchmarks (#2965)
- [New] Add rotating buffer feature to quantize_bench #2857)
- [New] Benchmark of fbgemm op - permute_multi_embedding (#2828 )
- [New] Add test for supporting torch.float16 and torch.bfloat16 (2992 )
- [Improvement] Fix logging and remove sync points in benchmarks (#3149, #3113, 2855)
- [Improvement] Update TBE training benchmark (#3112, #3074, #3051
- [Improvement] Improve ssd-training benchmark (#2850, #3004, #3069, #2989)
- [Improvement] Fix segfault in ssd training unit tests (#2929)
- [Improvement] Fixes on genai tests (#2864, #2885, #2970, #2849, 2869 )
- [Improvement] Fix minor issues in EmbeddingSpMDMNBitBenchmark (#2894)
- [Improvement] Fix test skipping for UVM tests (#3016)
- [Improvement] Fix failures_dict_fast.json in TBE inference test (#3024, #3060)
Build / CI improvements and Better Engineering
- [Improvement] General OSS fixes and script improvements (#2967, #2888, #2830,#2829, #2831, #2873, #2868, #2986, #2982, #3023, #2980, #2974, #2902, #3108, #3107, #3081, #3058, #3039, #3082, #3102, #3134, #2856, #3080)
- [Improvement] General enhancements (#2922, #2972, #2914, #2926, #3088, #3119, #3118, #3012, #3141, #3085)
- [Improvement] ROCm fixes (#2876, #2875, #2870, #2907, #3059)
- [Improvement] Refactoring (#2853, #2851, #2848, #2839, #2842, #2841, #2916, #2846, #3105, #3094, #3092, #3067, #3054, #3049, #2976, #2964, #2957, #2944, #3045, #3021, #3019, #3014, #3007, #2990)
- [New] Documentation (#3185, #3184, #3194, #3191, #3179, #3178, #3177, #3176, #3172, #3171, #3169, #3145, #3056, #3018, #2988, #2823, #2819, #2909, #2931)