pass@k results silently wrong when n<k #58

daniel-vainsencher · 2023-02-17T21:14:30Z

daniel-vainsencher
Feb 17, 2023

pass@k = 1 should be evidence that in k generations by this model, at least 1 is very likely to pass the test.

However, the definition of estimator returns 1 even when there are 0 passes among 99 tries if k=100. Nothing in the callers prevents using too small an n, in fact someone in a hurry is quite likely to use a small n (as I did in the original issue, oops).

Note in contrast how huggingface/evaluate does deal correctly with the n<k case: if that happens for any result, pass@k for that k is elided from the dictionary.

Originally posted by @daniel-vainsencher in #31 (comment)

arjunguha · 2023-04-20T16:27:47Z

arjunguha
Apr 20, 2023
Maintainer

This is a partial solution to this problem.

The script to calculate pass@k now prints the minimum and maximum number of completions per row:

https://github.com/nuprl/MultiPL-E/blob/dev/pass_k.py#L53

For the informed user, when MinCompletions < k, it means that the number in that row is unreliable.

The gold standard is MinCompletions == MaxCompletions == 200.

But, when operating at scale, it helps to look at intermediate results.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pass@k results silently wrong when n<k #58

{{title}}

Replies: 1 comment

{{title}}

Select a reply

pass@k results silently wrong when n<k #58

daniel-vainsencher Feb 17, 2023

Replies: 1 comment

arjunguha Apr 20, 2023 Maintainer

daniel-vainsencher
Feb 17, 2023

arjunguha
Apr 20, 2023
Maintainer