Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/Fix ClimateFever #1498

Open
Muennighoff opened this issue Nov 25, 2024 · 0 comments
Open

Investigate/Fix ClimateFever #1498

Muennighoff opened this issue Nov 25, 2024 · 0 comments

Comments

@Muennighoff
Copy link
Contributor

From @jhyuklee:

Jay found out that the TFDS version of Climate fever (https://www.tensorflow.org/datasets/community_catalog/huggingface/climate_fever) is not matching with the one uploaded for MTEB (https://huggingface.co/datasets/mteb/climate-fever/tree/main).

Specifically, the TFDS version indexes specific portions of the wiki articles (and in some cases two different parts of the article are linked by the same query id) and that MTEB/BEIR just takes the wiki article as a whole, but more importantly that the corpus text for the articles does not necessarily contain the text from the original target sentences/passages/subsections (but instead is just first x chars/tokens/or something).
Also it is worth noting that all of the qrels are scored as 1 in the MTEB version regardless of original rater annotations.

Since MTEB derived its preprocessing from BEIR, we are guessing that the discrepancy has started from BEIR.

I think it would be great investigating this and if it is an issue indeed then create an updated version of the Task to supersede it similar to Touchev3

class Touche2020v3Retrieval(AbsTaskRetrieval):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant