Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] merge methods #113

Open
Enferlain opened this issue Feb 1, 2024 · 5 comments
Open

[Feature request] merge methods #113

Enferlain opened this issue Feb 1, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@Enferlain
Copy link

Enferlain commented Feb 1, 2024

Thank you, this is a great addition for the merger space, but just wanted to ask if you thought about adding various merge_methods?

https://github.com/s1dlx/meh/blob/81515bda3cba52edd263964c5517f4713faad86e/sd_meh/merge_methods.py

Most up to date: https://github.com/ljleb/sd-mecha/blob/main/sd_mecha/merge_methods.py

@wkpark
Copy link
Owner

wkpark commented Feb 1, 2024

thank you for your information!

@wkpark wkpark self-assigned this Feb 1, 2024
@wkpark wkpark added the enhancement New feature or request label Feb 1, 2024
@wkpark
Copy link
Owner

wkpark commented Feb 22, 2024

Please see #35

@wkpark
Copy link
Owner

wkpark commented Mar 8, 2024

this routines are tensor level, it means it could be applied to the model-mixer easily.
but also we need to check each algorthm and its meaning and it's speed.

@ljleb
Copy link

ljleb commented Mar 8, 2024

There is a more up to date set of merge method implementations:
https://github.com/ljleb/sd-mecha/blob/main/sd_mecha/merge_methods.py

Everything marked as @convert_to_recipe is a merge method.

I can explain any/all of them if you want, as I came up with most of them. You can decide whether any is worth implementing. rotate is the slowest one, it takes ~1h on sdxl and ~9 minutes on sd1.5.

@ljleb
Copy link

ljleb commented Mar 8, 2024

All merge methods are key-level. I omitted weighted sum and add difference as they are trivial.

  • slerp(a, b): circular linear interpolation. normalize A and B, slerp, then recover a proper norm by interpolating the norm of A and the norm of B.
  • perpendicular_component(a, b): intended to be used in delta space: c + perpendicular_component(a - c, b - c). It finds the perpendicular component between A and B. This allows to add difference orthogonal information.
  • geometric_sum(a, b): AND gate. It works either in delta space or directly with model weights. It's equivalent to weighted sum in log space. For any corresponding parameters in A and B, if A or B is 0, then the result is 0; and if A and B have the same value, then that value is returned.
  • add_cosine_a(a, b), add_cosine_b(a, b): brought from supermerger, not exactly sure if they are good or not
  • ties_sum(a, b, c, ...): implementation of ties https://arxiv.org/abs/2306.01708
  • tensor_sum(a, b): copy parameters form A and from B using a window over dimension 0 to decide which model to pick the weights from. Brought from supermerger.
  • top_k_tensor_sum(a, b): reorder the parameters of A in the order of the parameters of B (call this reordered weight C). Then, determine a mask to pick the top k weights in A to give up on, and replace them with values from C at the corresponding indices in the weight.
  • train_difference(a, b, c): original supermerger train difference, except that I found a better filter metric. New method suggestions for additional merge potential (code+output comparisons included) hako-mikan/sd-webui-supermerger#264 (reply in thread)
  • multiply_quotient(a, b, c): train difference in log space. It tries to make the equation $\frac{AB}{C}$ work. Without the dissimilarity filter, it completely breaks down. Otherwise the resulting model gives very similar outputs to train difference, although it is still different. alpha can be brought up to 4 before it starts breaking down with NaNs at generation. @John-WL came up with the idea and I found a way to implement it.
  • distribution_crossover(a, b, c): reorder the weights of A and B into the order of C. Apply a crossover filter between A and B. A contributes the low end of the model, B contributes the high end. Then, reorder the merged weights back in the order of C.
  • crossover(a, b): n-dimensional crossover between A and B (in the case A and B are conv layers, the spatial dimensions stay on their axis). A contributes the low end, B contributes the high end. The weights are not reordered or reshaped. The filter should be isotropic, but honestly I'm not an expert in filter modelling so this might need to be verified.
  • rotate(a, b): find an orthogonal transform Q that minimizes the frobenius norm between AQ and B, then return $A^{'}Q^{\alpha}$ with $\alpha \in [0,1]$ the alignment factor. $A^{'}$ is the weighted sum between A and $BQ^T$, which effectively interpolates the relationship between the neurons of A and the neurons of B oriented towards A. Contrarily to other methods, this one works on the "neurons" of A and B. a "neuron" is just a quick way to talk about "all parameters that contribute to a single weighted sum operation during inference" (matrix multiplication can be seen as one weighted sum per output value). This is highly inspired by OPT https://opt-training.github.io/
  • clip(a, b): weights clipping, but allows to soften the clip bounds using many models.
  • dropout(a, b, c, ...): implementation of dare but with many models as input. I went a bit experimental with it by adding parameters to control the way the bernoulli mask is created.

I apologize if this is too much text to read. Let me know if I can clarify anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants