-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Italian tasks #1553
Comments
There seems to be quite a lot of translation here though (though high quality), I would probably consider finding some native Italian datasets as well. I also check out current tasks the code here should get you started:
I know that @rbroc has added some of these |
Hi @KennethEnevoldsen thank you for the feedbaks! For the native Italian datasets I found the following ones from the known Italian competition Evalita:
Based on an initial review, it appears that all the datasets are under the CC BY-NC-SA 4.0 license (I will look into this in more detail once we choose the dataset we want to focus on). However, access to the data is not straightforward and must be explicitly requested. Please add any comments or feedback on these datasets as well. Thank you! |
Looks very promising and sounds like something we could start working on.
Probably worth starting out by requesting these datasets |
Yes, I'm currently working on them. In the meantime, I'm also starting to upload the datasets to Hugging Face. Is this the preferred way to use them on MTEB right? |
Yes, that's the only way |
Hello everyone 👋🏼
I would like to contribute to the library by adding Italian datasets to the existing tasks. My goal is to eventually create an Italian leaderboard and/or make it possible to add an Italian tab on the HF space.
Before proceeding to create the tasks, I would like to share the datasets I plan to add. In this way, you can provide me feedback, inform me if any of the datasets are unnecessary or already exist, and suggest any other interesting Italian datasets that should be included.
Here is the list:
Tasks: Retrieval
Links: https://xmrec.github.io/
Tasks: BitextMining
Links: https://huggingface.co/datasets/Helsinki-NLP/europarl
Tasks: Classification, Bitext mining on documents?
Links: https://github.com/facebookresearch/MLDoc https://huggingface.co/datasets/PlanTL-GOB-ES/MLDoc)
Tasks: Bitext mining on article content, bitext mining on summaries, Text to summary
Links: https://huggingface.co/datasets/esdurmus/wiki_lingua/
Tasks: Bitext mining on answers, Bitext mining on questions, QA retrieval, Cross-lingual QA retrieval
Links: https://huggingface.co/datasets/apple/mkqa
Links : https://huggingface.co/datasets/microsoft/xglue
Tasks: Bitext mining on questions, bitext mining on answers, QA retrieval, Cross-lingual QA retrieval
Links: https://huggingface.co/datasets/coastalcph/multi_eurlex
Tasks: Bitext mining and classification on long documents
Links: https://github.com/google-research-datasets/swim-ir?tab=readme-ov-file
Tasks: Bitext mining, QA retrieval and Cross-lingual QA
Links: https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl (the link of all the dataset are listed inside this dataset card)
Tasks: Bitext mining
Links: https://huggingface.co/datasets/dennlinger/eur-lex-sum
Tasks: Bitext mining, Text to summary retrieval
Please add your suggestions and comments
The text was updated successfully, but these errors were encountered: