Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Ensure Device Compatibility for BOFT Forward/Merging #2242

Merged
merged 8 commits into from
Dec 9, 2024

Conversation

d-kleine
Copy link
Contributor

@d-kleine d-kleine commented Nov 29, 2024

Fixes #2219

Description

This pull request resolves above issue regarding BOFT forward/merging with CUDA by ensuring that all relevant tensors and models are moved to the correct device. This change is necessary to prevent issues such as zero matrices and test failures when using CUDA.

Changes

  • Ensuring that all operations will be performed on the same device:
    • Added torch_device = infer_device() to infer and set the appropriate device (CPU or CUDA) at the beginning of the MultipleActiveAdaptersTester class.
    • Modified prepare_inputs_for_testing() to move input tensors to the inferred device using .to(self.torch_device).
    • Updated model initialization in test_merge_layers_multi to move models to the inferred device with .to(self.torch_device).
  • Due to floating point precision on CUDA, for test_multiple_active_adapters_merge_and_unmerge the absolute tolerance had to be decreased from atol=1e-5 to atol=1e-4
  • Updated CUDA code to use Tensor.data_ptr<scalar_t>() instead of deprecated Tensor.data<scalar_t>() for improved compatibility with PyTorch standards

Testing

  • Verified that tests in tests/test_custom_models.py pass successfully with these changes, on both Linux (Ubuntu) and Windows
  • Ensured compatibility across different environments, including CPU and CUDA-enabled setups.

BenjaminBossan and others added 5 commits October 8, 2024 14:16
See: huggingface/diffusers#9510 (comment)

Right now, the low_cpu_mem_usage=True option does not consolidate the
devices. E.g. when the model is on GPU and the state_dict on CPU, the
adapter weight will be on CPU after loading, when it should be GPU. This
fix ensures that the devices are consolidated.
Solves the following bug:

huggingface/diffusers#9622 (comment)

The cause for the bug is as follows: When we have, say, a module called
"bar.0.query" that we want to target and another module called
"foo_bar.0.query" that we don't want to target, there was potential for
an error. This is not caused by _find_minimal_target_modules directly,
but rather the bug was inside of BaseTuner.inject_adapter and how the
names_no_target were chosen. Those used to be chosen based on suffix. In
our example, however, "bar.0.query" is a suffix of "foo_bar.0.query",
therefore "foo_bar.0.query" was *not* added to names_no_target when it
should have. As a consequence, during the optimization, it looks like
"query" is safe to use as target_modules because we don't see that it
wrongly matches "foo_bar.0.query".
@d-kleine d-kleine marked this pull request as ready for review November 29, 2024 08:19
@BenjaminBossan
Copy link
Member

Thanks a lot for opening this PR to fix the issue with the BOFT tests. It appears that your fork was outdated when opening the PR, as it contains unrelated changes and merge conflicts. Could you please try merging with/rebasing on the current main branch? If this creates new conflicts, opening a new PR from an up-to-date fork might be the easier solution.

@d-kleine
Copy link
Contributor Author

d-kleine commented Nov 29, 2024

Just merged main into my PR's branch (I was expecting that as I was implementing the changes from v0.13.2; just waited for a confirmation to merge from your side)

@d-kleine d-kleine changed the title Fix: Ensure Device Compatibility for BOFT Forward/Merging with CUDA FIX: Ensure Device Compatibility for BOFT Forward/Merging with CUDA Nov 29, 2024
@d-kleine d-kleine changed the title FIX: Ensure Device Compatibility for BOFT Forward/Merging with CUDA FIX: Ensure Device Compatibility for BOFT Forward/Merging Nov 29, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@d-kleine
Copy link
Contributor Author

d-kleine commented Nov 29, 2024

@BenjaminBossan Can you please run the failed test again? There were too many requests: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/models/hf-internal-testing/tiny-stable-diffusion-torch

@BenjaminBossan
Copy link
Member

@d-kleine Not sure what's going on with the CI right now, I'll rerun it next week, hopefully it solves itself.

@d-kleine
Copy link
Contributor Author

Alright, thanks!

@d-kleine d-kleine marked this pull request as draft December 2, 2024 02:17
@d-kleine
Copy link
Contributor Author

d-kleine commented Dec 2, 2024

I just have also fixed the fbd_cuda deprecation warning: #2219 (comment)

As the GitHub Actions workflows seem to run only on CPU, please ensure when reviewing to run the tests on a CUDA-enabled device and also empty the cache to see if the fbd_cuda warnings are gone now.

@d-kleine d-kleine marked this pull request as ready for review December 2, 2024 09:01
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your work on this @d-kleine. I can confirm that this works for me locally on CUDA.

Pinging @Zeju1997 who can hopefully confirm that this fix is good. I'll wait ~2-3 days for a reply, if I don't hear back by then I'll merge the PR.

@d-kleine
Copy link
Contributor Author

d-kleine commented Dec 2, 2024

Great, thanks! 👍🏻 In case you need further changes, please feel free to ping me 🙂

@BenjaminBossan
Copy link
Member

Okay, so we haven't heard back in a week, I'll just go ahead and merge this now. Again, thanks for this fix @d-kleine.

@BenjaminBossan BenjaminBossan merged commit ec92cdc into huggingface:main Dec 9, 2024
14 checks passed
@d-kleine
Copy link
Contributor Author

d-kleine commented Dec 9, 2024

Thanks for merging! 🙂

@d-kleine d-kleine deleted the boft_cuda branch December 9, 2024 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: BOFT forward/merging with CUDA
3 participants