Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Italian tasks #1553

Open
MattiaSangermano opened this issue Dec 4, 2024 · 5 comments
Open

New Italian tasks #1553

MattiaSangermano opened this issue Dec 4, 2024 · 5 comments

Comments

@MattiaSangermano
Copy link

Hello everyone 👋🏼

I would like to contribute to the library by adding Italian datasets to the existing tasks. My goal is to eventually create an Italian leaderboard and/or make it possible to add an Italian tab on the HF space.

Before proceeding to create the tasks, I would like to share the datasets I plan to add. In this way, you can provide me feedback, inform me if any of the datasets are unnecessary or already exist, and suggest any other interesting Italian datasets that should be included.

Here is the list:

Please add your suggestions and comments

@KennethEnevoldsen
Copy link
Contributor

Hi @MattiaSangermano!

  • XMarket: Sounds great to update the existing dataset to include the additional languages
  • Europarl: Great add
  • MLDOC: sounds like classification to me
  • WIKILingua: We have previously done these as retrieval tasks (from summary retrieve document)
  • Mkqa: QA retrieval I would say
  • Xglue: good
  • MultiEURLEX: Already there under the name "MultiEURLEXMultilabelClassification"
  • SWIM-IR: sounds a bit like WikipediaRetrievalMultilingual. Might be worth looking into
  • parallel: probably reasonable (we would have to consider a case by case basis)
  • eurlex sum: great

There seems to be quite a lot of translation here though (though high quality), I would probably consider finding some native Italian datasets as well. I also check out current tasks

the code here should get you started:

import mteb

tasks = mteb.get_tasks(languages=["ita"])
for t in tasks:
   print(t.metadata.name)
   print("\tDescription: ", t.metadata.description)
   print("\tAnnotations: ", t.metadata.annotations_creators)

I know that @rbroc has added some of these

@MattiaSangermano
Copy link
Author

Hi @KennethEnevoldsen thank you for the feedbaks! For the native Italian datasets I found the following ones from the known Italian competition Evalita:

Based on an initial review, it appears that all the datasets are under the CC BY-NC-SA 4.0 license (I will look into this in more detail once we choose the dataset we want to focus on). However, access to the data is not straightforward and must be explicitly requested.

Please add any comments or feedback on these datasets as well. Thank you!

@KennethEnevoldsen
Copy link
Contributor

Looks very promising and sounds like something we could start working on.

However, access to the data is not straightforward and must be explicitly requested.

Probably worth starting out by requesting these datasets

@MattiaSangermano
Copy link
Author

Yes, I'm currently working on them. In the meantime, I'm also starting to upload the datasets to Hugging Face. Is this the preferred way to use them on MTEB right?

@Samoed
Copy link
Collaborator

Samoed commented Dec 18, 2024

Yes, that's the only way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants