leaderboard v2.0: disagreement in scores between v2.0 and v1.0 #1516

KennethEnevoldsen · 2024-11-27T19:33:07Z

56,72375Some results seem to not agree with each other between the old and new leaderboard e.g. the ranking and scores for the Law tab look quite different

Here is an example:

For gritlm, multilingual-e5-large-instruct, multilingual-e5-base at least the models (generally) agree:

The tasks seem to match between v1 and v2.

Seems like scores on both benchmarks are:

scores	task
35.29	AILACasedocs
41.8	AILAStatutes
20.61	GerDaLIRSmall
64.22	LeCaRDv2
82.05	LegalBenchConsumerContractsQA
95	LegalBenchCorporateLobbying
44.18	LegalQuAD
70.64	LegalSummarization
56.723	avg

@x-tabdeveloping is there an issue with rounding here? it should be 56.72 not 56.73 (minor though)

Additionally not sure why the mean retrieval is not the same as the mean(task)?

originally posted to #1317

KennethEnevoldsen mentioned this issue Nov 27, 2024

Improve leaderboard 2.0 readability #1317

Closed

7 tasks

isaac-chung added the leaderboard issues related to the leaderboard label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leaderboard v2.0: disagreement in scores between v2.0 and v1.0 #1516

leaderboard v2.0: disagreement in scores between v2.0 and v1.0 #1516

KennethEnevoldsen commented Nov 27, 2024 •

edited

Loading

leaderboard v2.0: disagreement in scores between v2.0 and v1.0 #1516

leaderboard v2.0: disagreement in scores between v2.0 and v1.0 #1516

Comments

KennethEnevoldsen commented Nov 27, 2024 • edited Loading

KennethEnevoldsen commented Nov 27, 2024 •

edited

Loading