Skip to content

Bug and CUDA fixes + performance

Compare
Choose a tag to compare
@TimDettmers TimDettmers released this 23 Jul 14:10
· 323 commits to main since this release

Release 0.41.0 features an overhaul of the CUDA_SETUP routine. We trust PyTorch to find the proper CUDA binaries and use those. If you use a CUDA version that differs from PyTorch, you can now control the binary that is loaded for bitsandbytes by setting the BNB_CUDA_VERSION variable. See the custom CUDA guide for more information.

Besides that, this release features a wide range of bug fixes, CUDA 11.8 support for Ada and Hopper GPUs, and an update for 4-bit inference performance.

Previous 4-bit inference kernels were optimized for RTX 4090 and Ampere A40 GPUs, but the performance was poor for A100 GPUs, which are common. In this release, A100 performance is slightly improved (40%) and is not faster than 16-bit inference, while RTX 4090 and A40 is slightly lower (10% lower).

This leads to approximate speedups compared to 16-bit (BF16) of roughly:

  • RTX 4090: 3.8x
  • RTX 3090 / A40: 3.1x
  • A100: 1.5x
  • RTX 6000: 1.3x
  • RTX 2080 Ti: 1.1x

0.41.0

Features:

  • Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
  • CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk

Bug fixes:

  • Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
  • Fixed a missing scipy dependency in requirements.txt. #544
  • Fixed a bug, where a view operation could cause an error in 8-bit layers.
  • Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani
  • Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
  • Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
  • Fixed bug where read-permission was assumed for a file. #497
  • Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

Documentation:

  • Improved documentation for GPUs that do not support 8-bit matmul. #529
  • Added description and pointers for the NF4 data type. #543

User experience:

  • Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).

Performance:

  • improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.

Deprecated:

  • 8-bit quantization and optimizers that do not use blockwise quantization will be removed on 0.42.0. All blockwise methods will remain fully supported.