Question about Multi-Head Abalations #619

Hprairie · 2024-11-11T21:55:09Z

I'm running some experiments on the scaling laws of Mamba2 and was unable to find any hyperparameter values for the following ablation on multi-head structure.

Any details on learning rate, batch size, training steps, weight decay, gradient clipping, lr schedule, and optimizer values would be awesome!

Also what expansion factor is used for these ablations? Thanks so much for the help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Multi-Head Abalations #619

Question about Multi-Head Abalations #619

Hprairie commented Nov 11, 2024 •

edited

Loading

Question about Multi-Head Abalations #619

Question about Multi-Head Abalations #619

Comments

Hprairie commented Nov 11, 2024 • edited Loading

Hprairie commented Nov 11, 2024 •

edited

Loading