pass@k results silently wrong when n<k #58
daniel-vainsencher
started this conversation in
General
Replies: 1 comment
-
This is a partial solution to this problem. The script to calculate pass@k now prints the minimum and maximum number of completions per row: https://github.com/nuprl/MultiPL-E/blob/dev/pass_k.py#L53 For the informed user, when MinCompletions < k, it means that the number in that row is unreliable. The gold standard is But, when operating at scale, it helps to look at intermediate results. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
pass@k = 1 should be evidence that in k generations by this model, at least 1 is very likely to pass the test.
However, the definition of estimator returns 1 even when there are 0 passes among 99 tries if
k
=100. Nothing in the callers prevents using too small ann
, in fact someone in a hurry is quite likely to use a smalln
(as I did in the original issue, oops).Note in contrast how huggingface/evaluate does deal correctly with the n<k case: if that happens for any result, pass@k for that
k
is elided from the dictionary.Originally posted by @daniel-vainsencher in #31 (comment)
Beta Was this translation helpful? Give feedback.
All reactions