-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support optimum-quanto #1997
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
not stale |
This would be extremely helpful, currently Flux with lora mostly run on A100 like this, but flux can comfortably run on 4090, a consumer card. If this issue is resolved, it would be a way to run Flux with lora on consumer card like 4090. Thank you for your work @BenjaminBossan |
I am able to run this on a 4090: from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
)
pipeline.load_lora_weights(
"TheLastBen/Jon_Snow_Flux_LoRA", weight_name="jon_snow.safetensors"
)
pipeline.fuse_lora()
pipeline.unload_lora_weights()
pipeline.enable_model_cpu_offload()
prompt = "jon snow eating pizza with ketchup"
out = pipeline(prompt, num_inference_steps=20, guidance_scale=4.0)
out.images[0].save("output.png") What am I missing? |
Thanks for the write up. I will try it again later |
Thank you @sayakpaul I was running it using the FP8 version following
|
Why are you quantizing when you can perfectly run it without quantization? And we haven't yet landed the support to load LoRAs in a quantized base model. So, I am not going to comment on that. |
Understood. Really appreciate your work. Let me reframe my problems, as of now, quantimized Flux can comfortably run on 4090, but 4090 cannot run a BF16 version of Flux seamlessly, for example, in my case. I guess BF16 version of Flux is really pushing limit on 24GB GPU memory |
I have provided a working code snippet. If you run the example with the latest versions of |
Sure, will run the test again |
1 similar comment
Sure, will run the test again |
Sorry you are right. Using latest hugging face libraries they do work on 4090. Sorry for the trouble @sayakpaul |
No issues, glad things worked out. You might also be interested in huggingface/diffusers#9213 :) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
not stale |
Feature request
Let's add a new quantization method to LoRA, namely optimum-quanto.
There is some more context in this diffusers issue.
Motivation
First of all, the more quantization methods we support the better. But notably, quanto also works with MPS, which distinguishes it from other quantization methods.
Your contribution
I did some preliminary testing and partly, quanto already works with PEFT, as the
QLinear
layer is a subclass ofnn.Linear
and as such,lora.Linear
is applied. Some features like inference appear to work already. However, some features don't work correctly, like merging. Here is a very quick test:Note that all the outputs involving merging are not as expected.
I can certainly take this when I have time but contributions are highly welcome. For inspiration, check out past PRs that add new quantization methods.
The text was updated successfully, but these errors were encountered: