Release Bug and CUDA fixes + performance · bitsandbytes-foundation/bitsandbytes

Release 0.41.0 features an overhaul of the CUDA_SETUP routine. We trust PyTorch to find the proper CUDA binaries and use those. If you use a CUDA version that differs from PyTorch, you can now control the binary that is loaded for bitsandbytes by setting the BNB_CUDA_VERSION variable. See the custom CUDA guide for more information.

Besides that, this release features a wide range of bug fixes, CUDA 11.8 support for Ada and Hopper GPUs, and an update for 4-bit inference performance.

Previous 4-bit inference kernels were optimized for RTX 4090 and Ampere A40 GPUs, but the performance was poor for A100 GPUs, which are common. In this release, A100 performance is slightly improved (40%) and is not faster than 16-bit inference, while RTX 4090 and A40 is slightly lower (10% lower).

This leads to approximate speedups compared to 16-bit (BF16) of roughly:

RTX 4090: 3.8x
RTX 3090 / A40: 3.1x
A100: 1.5x
RTX 6000: 1.3x
RTX 2080 Ti: 1.1x

0.41.0

Features:

Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk

Bug fixes:

Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
Fixed a missing scipy dependency in requirements.txt. #544
Fixed a bug, where a view operation could cause an error in 8-bit layers.
Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani
Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
Fixed bug where read-permission was assumed for a file. #497
Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

Documentation:

Improved documentation for GPUs that do not support 8-bit matmul. #529
Added description and pointers for the NF4 data type. #543

User experience:

Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).

Performance:

improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.

Deprecated:

8-bit quantization and optimizers that do not use blockwise quantization will be removed on 0.42.0. All blockwise methods will remain fully supported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug and CUDA fixes + performance

0.41.0

Contributors