-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sample Sparsification Method #250
base: main
Are you sure you want to change the base?
Conversation
This reverts commit bb1c74f.
…/mergekit into Sample-Sparsification
I have now realized that this isn't the only merge like this possible. This merge is quite similar to dropping, but there is one that is similar to trimming that reaches a similar result. I might try implementing that as well. |
Here are the empirical test results of Sampling vs DARE and magnitude:
Several samples are shown to show randomness As can be seen, sampling shows less variance than DARE and appears to not have the "bad runs" like it does. It does also show nondeterminism, unlike magnitude. We cannot conclude from this small run with a low sample size that sampling is better than magnitude or DARE, but it does show that it has promise and already show some good effects (low variance compared to DARE). |
…/mergekit into Sample-Sparsification
Added another method as well as a parameter that affects both of them. This new method is based on TopK, but instead of just restricting it to a certain percentage, we use the rank of the tensors to build a bernoulli distribution (just like the sampling method). This method should be closer to TopK and have even lower variance than Sampling. This new parameter is more experimental. It simply skips the bernoulli step, resulting in a spread distribution without all the zeros. Since the value is concentrated in a part of a tensor, like sparsification, this should result in a decrease in conflict. Since a lot more of the original values are maintained, this method should be more robust, allowing for lower density values and more stable iterative merges. (Smooth selection is also deterministic, which may be preferred.) |
Looks like this breaks the sparsification unit tests - could you update them to pass in the new arguments (or give them default values?) |
Oh, I think I know what the problem is. Sorry, haven't really looked at the sparsification unit tests, so I'll try to fix them. |
This is a new sparsification method that I have been thinking about. The trimming and dropping methods resemble the Top-P and Typical-P methods used in sampling LLMs. However, by far the most popular sampler is the temperature sampler.
This sparsification method samples the tensor itself to create its mask. This method is FAR more computationally expensive, but it, theoritically, should outperform other methods.