Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leaderboard v2.0: disagreement in scores between v2.0 and v1.0 #1516

Open
KennethEnevoldsen opened this issue Nov 27, 2024 · 0 comments
Open
Labels
leaderboard issues related to the leaderboard

Comments

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Nov 27, 2024

56,72375Some results seem to not agree with each other between the old and new leaderboard e.g. the ranking and scores for the Law tab look quite different

Here is an example:

Screenshot 2024-11-27 at 20 28 03 Screenshot 2024-11-27 at 20 28 13

For gritlm, multilingual-e5-large-instruct, multilingual-e5-base at least the models (generally) agree:

The tasks seem to match between v1 and v2.

Seems like scores on both benchmarks are:

scores task
35.29 AILACasedocs
41.8 AILAStatutes
20.61 GerDaLIRSmall
64.22 LeCaRDv2
82.05 LegalBenchConsumerContractsQA
95 LegalBenchCorporateLobbying
44.18 LegalQuAD
70.64 LegalSummarization
56.723 avg

@x-tabdeveloping is there an issue with rounding here? it should be 56.72 not 56.73 (minor though)

Additionally not sure why the mean retrieval is not the same as the mean(task)?

originally posted to #1317

@isaac-chung isaac-chung added the leaderboard issues related to the leaderboard label Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leaderboard issues related to the leaderboard
Projects
None yet
Development

No branches or pull requests

2 participants