Releases · bitsandbytes-foundation/bitsandbytes

30 Sep 16:21

matthewdouglas

0.44.1

263af70

0.44.1 Latest

Latest

What's Changed

Fix optimizer support for Python <= 3.9 by @matthewdouglas in #1379

Full Changelog: 0.44.0...0.44.1

Contributors

matthewdouglas

Assets 2

30 Sep 18:51

github-actions

continuous-release_multi-backend-refactor

09ac7ec

Multi-Backend Preview Pre-release

Pre-release

To try this out, simply pip install 'FULL_DOWNLOAD_LINK' with the download link from correct wheel that are included below in the release in the "Assets" section.

Note that Windows is not supported for the AMD ROCm backend.

Assets 6

30 Sep 22:53

github-actions

continuous-release_main

0500c31

Latest `main` wheel Pre-release

Pre-release

To try this out, simply pip install 'FULL_DOWNLOAD_LINK' with the download link from correct wheel that are included below in the release in the "Assets" section.

These wheels get built on every commit and become available as soon as the python-package.yml GH workflow finished executing.

Assets 4

29 Sep 16:31

matthewdouglas

0.44.0

9841697

0.44.0: New AdEMAMix optimizer, Embeddings quantization, and more!

New optimizer: AdEMAMix

The AdEMAMix optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting.

We've implemented 8bit and paged variations: AdEMAMix, AdEMAMix8bit, PagedAdEMAMix, and PagedAdEMAMix8bit. These can be used with a similar API to existing optimizers.

import bitsandbytes as bnb

optimizer = bnb.optim.PagedAdEMAMix8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999, 0.9999),
    alpha=5.0,
    eps=1e-8,
    weight_decay=1e-2,
    alpha=5.0,
)

8-bit Optimizers Update

The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in the paper which improves accuracy.

CUDA Graphs support

A fix to enable CUDA Graphs capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee!

Quantization for Embeddings

The trend of LLMs to use larger vocabularies continues. The embeddings can take up a significant portion of a quantized model's footprint. We now have an implementation of Embedding4bit and Embedding8bit thanks to @galqiwi!

Example usage:

import torch
import torch.nn as nn

from bitsandbytes.nn import Embedding4bit

fp16_module = nn.Embedding(128, 64)
quantized_module = Embedding4bit(128, 64)

quantized_module.load_state_dict(fp16_module.state_dict())

quantized_module = quantized_module.to(0)

Continuous Builds

We are now building binary wheels for each change on main. These builds can be used to preview upcoming changes.

🚤 Continuous Build

What's Changed

Embedding4bit and Embedding8bit implementation by @galqiwi in #1292
Bugfix: Load correct nocublaslt library variant when BNB_CUDA_VERSION override is set by @matthewdouglas in #1318
Enable certain CUDA kernels to accept specified cuda stream by @jeejeelee in #1330
Initial support for ppc64le by @mgiessing in #1316
Cuda source cleanup , refactor and fixes by @abhilash1910 in #1328
Update for VS2022 17.11 compatibility with CUDA < 12.4 by @matthewdouglas in #1341
Bump the minor-patch group with 3 updates by @dependabot in #1362
Update matplotlib requirement from ~=3.9.1 to ~=3.9.2 in the major group by @dependabot in #1361
docs: add internal reference to multi-backend guide by @Titus-von-Koeller in #1352
Add move_to_device kwarg to the optimizer's load_state_dict by @koute in #1344
Add AdEMAMix optimizer by @matthewdouglas in #1360
Change 8bit optimizer blocksize 2048->256; additional bf16 support by @matthewdouglas in #1365

New Contributors

@jeejeelee made their first contribution in #1330
@mgiessing made their first contribution in #1316
@abhilash1910 made their first contribution in #1328
@koute made their first contribution in #1344

Full Changelog: 0.43.3...v0.44.0

Contributors

koute, Titus-von-Koeller, and 6 other contributors

Assets 2

0 Join discussion

30 Jul 20:48

Titus-von-Koeller

0.43.3

2e03d34

0.43.3: enabling LLama 405b with 8xH/A100 + 256GB RAM

Improvements:

FSDP: Enable loading prequantized weights with bf16/fp16/fp32 quant_storage
- Background: This update, linked to Transformer PR #32276, allows loading prequantized weights with alternative storage formats. Metadata is tracked similarly to Params4bit.__new__ post PR #970. It supports models exported with non-default quant_storage, such as this NF4 model with BF16 storage.
- Special thanks to @winglian and @matthewdouglas for enabling FSDP+QLoRA finetuning of Llama 3.1 405B on a single 8xH100 or 8xA100 node with as little as 256GB system RAM.

Contributors

winglian and matthewdouglas

Assets 4

0 Join discussion

23 Jul 18:42

Titus-von-Koeller

0.43.2

ce53caf

0.43.2: finetune Llama 405B on 4x GPUs with improved QLoRA+FSDP, CUDA 12.5 support

0.43.2

This release is quite significant as the QLoRA bug fix has big implications for higher seqlen and batch sizes.

For each sequence (i.e. batch size increase of one) we expect memory savings of:

405B: 39GB for seqlen=1024, and 4888GB for seqlen=128,00
70B: 10.1GB for seqlen=1024 and 1258GB for seqlen=128,00

This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.

Improvements:

docs: FSDP+QLoRA and CPU install guide (#1211 #1227, thanks @stevhliu)
Add CUDA 12.5 and update 12.4 builds (#1284)

Bug Fixes

4bit getstate and 8bit deepcopy (#1230 #1231, thanks @BenjaminBossan)
missing optimizers in str2optimizer32bit (#1222, thanks @EtienneDosSantos)
CUDA 12.5 build issue (#1273, thanks @HennerM)
fix for min_8bit_size functionality in Optimizer base classes (#1286, thanks @Edenzzzz)
QLoRA mem bug (#1270, thanks @Ther-nullptr)
tests for cpu only platforms (#1259, thanks @galqiwi)
restoration of quant_storage for CPU offloading (#1279)
optim update error with non-contiguous grads/params (deepspeed) (#1187)

Contributors

HennerM, BenjaminBossan, and 5 other contributors

Assets 4

0 Join discussion

11 Apr 18:36

Titus-von-Koeller

0.43.1

4a6fb35

0.43.1: Improved CUDA setup/diagnostics + 8-bit serialization, CUDA 12.4 support, docs enhancements

Improvements:

Improved the serialization format for 8-bit weights; this change is fully backwards compatible. (#1164, thanks to @younesbelkada for the contributions and @akx for the review).
Added CUDA 12.4 support to the Linux x86-64 build workflow, expanding the library's compatibility with the latest CUDA versions. (#1171, kudos to @matthewdouglas for this addition).
Docs enhancement: Improved the instructions for installing the library from source. (#1149, special thanks to @stevhliu for the enhancements).

Bug Fixes

Fix 4bit quantization with blocksize = 4096, where an illegal memory access was encountered. (#1160, thanks @matthewdouglas for fixing and @YLGH for reporting)

Internal Improvements:

Tests: improve memory usage (#1147, thanks @matthewdouglas)
Add CUDA 12.4 to docs/install helper (#1136, thanks @matthewdouglas)
Minor type/doc fixes (#1128, thanks @akx)
Reformat Python code with Ruff (#1081, thanks @akx)
Rework of CUDA/native-library setup and diagnostics (#1041, thanks @akx)

Contributors

akx, YLGH, and 3 other contributors

Assets 4

08 Mar 01:42

Titus-von-Koeller

0.43.0

4876324

0.43.0: FSDP support, Official documentation, Cross-compilation on Linux and CI, Windows support

Improvements and New Features:

QLoRA + FSDP official support is now live! #970 by @warner-benjamin and team - with FSDP you can train very large models (70b scale) on multiple 24GB consumer-type GPUs. See https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html for more details.
Introduced improvements to the CI process for enhanced performance and efficiency during builds, specifically enabling more effective cross-compilation on Linux platforms. This was accomplished by deprecating Make and migrating to Cmake, as well as implementing new corresponding workflows. Huge thanks go to @wkpark, @rickardp, @matthewdouglas and @younesbelkada; #1055, #1050, #1111.
Windows should be officially supported in bitsandbytes with pip install bitsandbytes
Updated installation instructions to provide more comprehensive guidance for users. This includes clearer explanations and additional tips for various setup scenarios, making the library more accessible to a broader audience (@rickardp, #1047).
Enhanced the library's compatibility and setup process, including fixes for CPU-only installations and improvements in CUDA setup error messaging. This effort aims to streamline the installation process and improve user experience across different platforms and setups (@wkpark, @akx, #1038, #996, #1012).
Setup a new documentation at https://huggingface.co/docs/bitsandbytes/main with extensive new sections and content to help users better understand and utilize the library. Especially notable are the new API docs. (big thanks to @stevhliu and @mishig25 from HuggingFace #1012). The API docs have been also addressed in #1075

Bug Fixes:

Addressed a race condition in kEstimateQuantiles, enhancing the reliability of quantile estimation in concurrent environments (@pnunna93, #1061).
Fixed various minor issues, including typos in code comments and documentation, to improve code clarity and prevent potential confusion (@nairbv, #1063).

Backwards Compatibility

After upgrading from v0.42 to v0.43, when using 4bit quantization, models may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the code. For anyone interested in the details, see this comment.

Internal and Build System Enhancements:

Implemented several enhancements to the internal and build systems, including adjustments to the CI workflows, portability improvements, and build artifact management. These changes contribute to a more robust and flexible development process, ensuring the library's ongoing quality and maintainability (@rickardp, @akx, @wkpark, @matthewdouglas; #949, #1053, #1045, #1037).

Contributors:

This release is made possible thanks to the many active contributors that submitted PRs and many others who contributed to discussions, reviews, and testing. Your efforts greatly enhance the library's quality and user experience. It's truly inspiring to work with such a dedicated and competent group of volunteers and professionals!

We give a special thanks to @TimDettmers for managing to find a little bit of time for valuable consultations on critical topics, despite preparing for and touring the states applying for professor positions. We wish him the utmost success!

We also extend our gratitude to the broader community for your continued support, feedback, and engagement, which play a crucial role in driving the library's development forward.

Contributors

akx, rickardp, and 9 other contributors

Assets 13

08 Jan 01:19

TimDettmers

0.42.0

4870580

4-bit serialization and bug fixes

This release added 4-bit serialization, implemented by @poedator, to bitsandbytes. With this,you can call model.save() and model.load() for models that contain 4-bit bitsandbytes layers meaning you can save and load 4-bit models. All of this is integrated with the Hugging Face transformers stack. The 0.42.0 release also comes with many bug fixes. See below for detailed change logs.

0.42.0

Features:

4-bit serialization now supported. This enables 4-bit load/store. Thank you @poedator #753
the bitsandbytes library now has a version attribute: bitsandbytes.__version__ @rasbt #710

Bug fixes:

Fixed bugs in dynamic exponent data type creation. Thank you @RossM, @KohakuBlueleaf, @ArrowM #659 #227 #262 #152
Fixed an issue where 4-bit serialization would fail for layers without double quantization #868. Thank you, @poedator
Fixed an issue where calling .to() or .cuda() on a 4-bit layer twice would result in an error #867. Thank you, @jph00
Fixed a bug where a missing access permission in a path searched for CUDA would lead to an error @osma #677
Fixed a bug where the GOOGLE_VM_CONFIG_LOCK_FILE variable could cause errors in colab environments @akrentsel @xaptronic #715 #883 #622
Fixed a bug where kgetColRowStats (LLM.int8()) would fail for certain dimensions @LucQueen #905
Fixed a bug where the adjusted regular Embedding layer was not available via bnb.nn.Embedding @neel04 #563
Fixed added missing scipy requirement @dulalbert #525

Contributors

jph00, xaptronic, and 10 other contributors

Assets 4

23 Jul 14:10

TimDettmers

0.41.0

a06a0f6

Bug and CUDA fixes + performance

Release 0.41.0 features an overhaul of the CUDA_SETUP routine. We trust PyTorch to find the proper CUDA binaries and use those. If you use a CUDA version that differs from PyTorch, you can now control the binary that is loaded for bitsandbytes by setting the BNB_CUDA_VERSION variable. See the custom CUDA guide for more information.

Besides that, this release features a wide range of bug fixes, CUDA 11.8 support for Ada and Hopper GPUs, and an update for 4-bit inference performance.

Previous 4-bit inference kernels were optimized for RTX 4090 and Ampere A40 GPUs, but the performance was poor for A100 GPUs, which are common. In this release, A100 performance is slightly improved (40%) and is not faster than 16-bit inference, while RTX 4090 and A40 is slightly lower (10% lower).

This leads to approximate speedups compared to 16-bit (BF16) of roughly:

RTX 4090: 3.8x
RTX 3090 / A40: 3.1x
A100: 1.5x
RTX 6000: 1.3x
RTX 2080 Ti: 1.1x

0.41.0

Features:

Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk

Bug fixes:

Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
Fixed a missing scipy dependency in requirements.txt. #544
Fixed a bug, where a view operation could cause an error in 8-bit layers.
Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani
Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
Fixed bug where read-permission was assumed for a file. #497
Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

Documentation:

Improved documentation for GPUs that do not support 8-bit matmul. #529
Added description and pointers for the NF4 data type. #543

User experience:

Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).

Performance:

improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.

Deprecated:

8-bit quantization and optimizers that do not use blockwise quantization will be removed on 0.42.0. All blockwise methods will remain fully supported.

Contributors

bilelomrani, jllllll, and 3 other contributors

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

New optimizer: AdEMAMix

8-bit Optimizers Update

CUDA Graphs support

Quantization for Embeddings

Continuous Builds

What's Changed

New Contributors

Contributors

Improvements:

Contributors

0.43.2

Improvements:

Bug Fixes

Contributors

Improvements:

Bug Fixes

Internal Improvements:

Contributors

Improvements and New Features:

Bug Fixes:

Backwards Compatibility

Internal and Build System Enhancements:

Contributors:

Contributors

0.42.0

Features:

Bug fixes:

Contributors

0.41.0

Contributors

Releases: bitsandbytes-foundation/bitsandbytes

0.44.1

What's Changed

Contributors

Multi-Backend Preview

Latest `main` wheel

0.44.0: New AdEMAMix optimizer, Embeddings quantization, and more!

New optimizer: AdEMAMix

8-bit Optimizers Update

CUDA Graphs support

Quantization for Embeddings

Continuous Builds

What's Changed

New Contributors

Contributors

0.43.3: enabling LLama 405b with 8xH/A100 + 256GB RAM

Improvements:

Contributors

0.43.2: finetune Llama 405B on 4x GPUs with improved QLoRA+FSDP, CUDA 12.5 support

0.43.2

Improvements:

Bug Fixes

Contributors

0.43.1: Improved CUDA setup/diagnostics + 8-bit serialization, CUDA 12.4 support, docs enhancements

Improvements:

Bug Fixes

Internal Improvements:

Contributors

0.43.0: FSDP support, Official documentation, Cross-compilation on Linux and CI, Windows support

Improvements and New Features:

Bug Fixes:

Backwards Compatibility

Internal and Build System Enhancements:

Contributors:

Contributors

4-bit serialization and bug fixes

0.42.0

Features:

Bug fixes:

Contributors

Bug and CUDA fixes + performance

0.41.0

Contributors