You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The OLMES paper is a pretty interesting read and is complementary to LM-Eval. I think there are a few features we can consider implementing based on lessons and recommendations from the paper.
Implement mode-switch between Multiple-choice formulation and Completion/cloze formulation. Empirically models respond to both version differently depending on how much tokens trained with the former showing a stronger signal in later stages (400B tokens and above). The recommendation is to evaluate on both and take the highest score. Lm-eval would benefit from having the ability to write 1 prompt format and have it automatically be used in both formulations.
Normalization should be a configuration feature. We use both non-normalized accuracy and normalized accuracy, specifically by dividing the log-probability by the length or characters. It would be great to be able to choose the normalization and add more (normalization like based on token length or pointwise-mutual information)
Further add support for fewshot selection. We have some support for hardcoding fewshot samples, but we've never really supported a way to make it easier to select fewshots with conditions like "make sure the answer choices are not all A" or set a list of index number that points to the samples directly.
The OLMES paper is a pretty interesting read and is complementary to LM-Eval. I think there are a few features we can consider implementing based on lessons and recommendations from the paper.
cc: @haileyschoelkopf @StellaAthena
The text was updated successfully, but these errors were encountered: