You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running some experiments on the scaling laws of Mamba2 and was unable to find any hyperparameter values for the following ablation on multi-head structure.
Any details on learning rate, batch size, training steps, weight decay, gradient clipping, lr schedule, and optimizer values would be awesome!
Also what expansion factor is used for these ablations? Thanks so much for the help!
The text was updated successfully, but these errors were encountered:
I'm running some experiments on the scaling laws of Mamba2 and was unable to find any hyperparameter values for the following ablation on multi-head structure.
Any details on learning rate, batch size, training steps, weight decay, gradient clipping, lr schedule, and optimizer values would be awesome!
Also what expansion factor is used for these ablations? Thanks so much for the help!
The text was updated successfully, but these errors were encountered: