diff --git a/CHANGELOG.md b/CHANGELOG.md index ee4da2fe7..706387d9e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -283,3 +283,30 @@ Bug fixes: - Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk - Fixed bug where read-permission was assumed for a file. #497 - Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro + + +### 0.41.0 + +Features: + - Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571 + - CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk + +Bug fixes: + - Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553 + - Fixed a missing scipy dependency in requirements.txt. #544 + - Fixed a bug, where a view operation could cause an error in 8-bit layers. + - Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani + - Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588 + - Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk + - Fixed bug where read-permission was assumed for a file. #497 + - Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro + +Documentation: + - Improved documentation for GPUs that do not support 8-bit matmul. #529 + - Added description and pointers for the NF4 data type. #543 + +User experience: + - Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16). + +Performance: + - improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly. diff --git a/setup.py b/setup.py index 5769c856f..565bfe715 100644 --- a/setup.py +++ b/setup.py @@ -18,7 +18,7 @@ def read(fname): setup( name=f"bitsandbytes", - version=f"0.40.2", + version=f"0.41.0", author="Tim Dettmers", author_email="dettmers@cs.washington.edu", description="k-bit optimizers and matrix multiplication routines.",