Implementing lessons from OLMES #2002

lintangsutawika · 2024-06-20T08:30:43Z

The OLMES paper is a pretty interesting read and is complementary to LM-Eval. I think there are a few features we can consider implementing based on lessons and recommendations from the paper.

Implement mode-switch between Multiple-choice formulation and Completion/cloze formulation. Empirically models respond to both version differently depending on how much tokens trained with the former showing a stronger signal in later stages (400B tokens and above). The recommendation is to evaluate on both and take the highest score. Lm-eval would benefit from having the ability to write 1 prompt format and have it automatically be used in both formulations.
Normalization should be a configuration feature. We use both non-normalized accuracy and normalized accuracy, specifically by dividing the log-probability by the length or characters. It would be great to be able to choose the normalization and add more (normalization like based on token length or pointwise-mutual information)
Further add support for fewshot selection. We have some support for hardcoding fewshot samples, but we've never really supported a way to make it easier to select fewshots with conditions like "make sure the answer choices are not all A" or set a list of index number that points to the samples directly.

cc: @haileyschoelkopf @StellaAthena

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing lessons from OLMES #2002

Implementing lessons from OLMES #2002

lintangsutawika commented Jun 20, 2024

Implementing lessons from OLMES #2002

Implementing lessons from OLMES #2002

Comments

lintangsutawika commented Jun 20, 2024