Investigate/Fix ClimateFever #1498

Muennighoff · 2024-11-25T23:48:47Z

Jay found out that the TFDS version of Climate fever (https://www.tensorflow.org/datasets/community_catalog/huggingface/climate_fever) is not matching with the one uploaded for MTEB (https://huggingface.co/datasets/mteb/climate-fever/tree/main).

Specifically, the TFDS version indexes specific portions of the wiki articles (and in some cases two different parts of the article are linked by the same query id) and that MTEB/BEIR just takes the wiki article as a whole, but more importantly that the corpus text for the articles does not necessarily contain the text from the original target sentences/passages/subsections (but instead is just first x chars/tokens/or something).
Also it is worth noting that all of the qrels are scored as 1 in the MTEB version regardless of original rater annotations.

Since MTEB derived its preprocessing from BEIR, we are guessing that the discrepancy has started from BEIR.

I think it would be great investigating this and if it is an issue indeed then create an updated version of the Task to supersede it similar to Touchev3

mteb/mteb/tasks/Retrieval/eng/Touche2020Retrieval.py

Line 54 in 3ff38ec

class Touche2020v3Retrieval(AbsTaskRetrieval):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate/Fix ClimateFever #1498

Investigate/Fix ClimateFever #1498

Muennighoff commented Nov 25, 2024

Investigate/Fix ClimateFever #1498

Investigate/Fix ClimateFever #1498

Comments

Muennighoff commented Nov 25, 2024