New Italian tasks #1553

MattiaSangermano · 2024-12-04T22:45:37Z

Hello everyone 👋🏼

I would like to contribute to the library by adding Italian datasets to the existing tasks. My goal is to eventually create an Italian leaderboard and/or make it possible to add an Italian tab on the HF space.

Before proceeding to create the tasks, I would like to share the datasets I plan to add. In this way, you can provide me feedback, inform me if any of the datasets are unnecessary or already exist, and suggest any other interesting Italian datasets that should be included.

Here is the list:

XMarket: An e-commerce category-to-product retrieval dataset in multiple languages. Should be already available in the mteb library but only in Spanish, English and German.
Tasks: Retrieval
Links: https://xmrec.github.io/
Europarl: A corpus of parallel text in 21 European languages from the proceedings of the European Parliament.
Tasks: BitextMining
Links: https://huggingface.co/datasets/Helsinki-NLP/europarl
MLDOC: A Corpus for Multilingual Document Classification in Eight Languages.
Tasks: Classification, Bitext mining on documents?
Links: https://github.com/facebookresearch/MLDoc https://huggingface.co/datasets/PlanTL-GOB-ES/MLDoc)
WIKILIngua: Dataset containing article, summary and title pairs in 18 languages from WikiHow.
Tasks: Bitext mining on article content, bitext mining on summaries, Text to summary
Links: https://huggingface.co/datasets/esdurmus/wiki_lingua/
Mkqa: Dataset containing query (from Google Natural Questions dataset) and new passage-independent answers. These queries and answers are then human translated into 25 Non-English languages.
Tasks: Bitext mining on answers, Bitext mining on questions, QA retrieval, Cross-lingual QA retrieval
Links: https://huggingface.co/datasets/apple/mkqa
XGlue:
Links : https://huggingface.co/datasets/microsoft/xglue
Tasks: Bitext mining on questions, bitext mining on answers, QA retrieval, Cross-lingual QA retrieval
MultiEurlex: MultiEURLEX comprises 65k EU laws in 23 official EU languages. Each EU law has been annotated with EUROVOC concepts (labels) by the Publication Office of EU.
Links: https://huggingface.co/datasets/coastalcph/multi_eurlex
Tasks: Bitext mining and classification on long documents
SWIM-IR: is a Synthetic Wikipedia-based Multilingual Information Retrieval training dataset consisting of 28 million query-passage pairs spanning 33 languages.
Links: https://github.com/google-research-datasets/swim-ir?tab=readme-ov-file
Tasks: Bitext mining, QA retrieval and Cross-lingual QA
** Parallel datasets*: 11 parallel datasets
Links: https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl (the link of all the dataset are listed inside this dataset card)
Tasks: Bitext mining
Eurlex sum: Dataset based on human-written summaries of legal acts issued by the European Union.
Links: https://huggingface.co/datasets/dennlinger/eur-lex-sum
Tasks: Bitext mining, Text to summary retrieval

Please add your suggestions and comments

KennethEnevoldsen · 2024-12-05T08:42:12Z

Hi @MattiaSangermano!

XMarket: Sounds great to update the existing dataset to include the additional languages
Europarl: Great add
MLDOC: sounds like classification to me
WIKILingua: We have previously done these as retrieval tasks (from summary retrieve document)
Mkqa: QA retrieval I would say
Xglue: good
MultiEURLEX: Already there under the name "MultiEURLEXMultilabelClassification"
SWIM-IR: sounds a bit like WikipediaRetrievalMultilingual. Might be worth looking into
parallel: probably reasonable (we would have to consider a case by case basis)
eurlex sum: great

There seems to be quite a lot of translation here though (though high quality), I would probably consider finding some native Italian datasets as well. I also check out current tasks

the code here should get you started:

import mteb

tasks = mteb.get_tasks(languages=["ita"])
for t in tasks:
   print(t.metadata.name)
   print("\tDescription: ", t.metadata.description)
   print("\tAnnotations: ", t.metadata.annotations_creators)

I know that @rbroc has added some of these

MattiaSangermano · 2024-12-12T21:54:52Z

Hi @KennethEnevoldsen thank you for the feedbaks! For the native Italian datasets I found the following ones from the known Italian competition Evalita:

MultiEmotionsIt: Detection of emotions in social media messages about TV shows, TV series, music videos and advertisements.
Links: http://www.di.unito.it/~tutreeb/emit23/ - https://github.com/oaraque/emit
Task: Classification (Task A)
PoliticIt: Classification of political ideology information from clusters of Italian tweets.
Link: https://codalab.lisn.upsaclay.fr/competitions/8507#learn_the_details-overview
Task: Classification (Task 2 - multiclass)
Note: From the dataset description the classification should be made on the cluster of texts rather than on the single samples. How can be handled this task inside the mteb library?
GeoLingIt: is the first shared task on geolocation of linguistic variation in Italy from social media posts exhibiting non-standard Italian language.
Links: https://sites.google.com/view/geolingit - https://sites.google.com/view/geolingit/data
Task: Classification (Task A)
Discotex: Datasets focused on modelling discourse coherence for Italian real-word texts.
Link: https://sites.google.com/view/discotex/task - https://sites.google.com/view/discotex/data
Task: Pair classification (Task 1), STS (Task 2)
Wic ita: Task focused on establishing if a word w occurring in two different sentences s1 and s2 has the same meaning or not.
Link: https://wic-ita.github.io/task/
Task: STS (Task 2)
Note: In this case, the task focuses on the contextualized meaning of individual words rather than the meaning of entire sentences. Therefore, we should consider the embeddings of individual words instead of sentence embeddings (ie CLS token). However, I believe the task still falls under the STS task, do you agree?
Sardi-Stance: Stance Detection in Italian tweets.
Links: http://www.di.unito.it/~tutreeb/sardistance-evalita2020/index.html
Task: Classification (Task A)
Change-it: Dataset containing style transfer task for headlines of Italian newspapers.
Link: https://sites.google.com/view/change-it/home
Task: Retrieval (from headline to document), classification (identify if the headline or article is written in a left or right-wing style)
Tag-IT:
Link: https://sites.google.com/view/tag-it-2020/home-page
Task: Clustering
DaDoEval: assigning a temporal span to a document, i.e. recognising when a document was issued (sub-task 2)
Link: https://dhfbk.github.io/DaDoEval/
Task: Classification (Task 2)

Based on an initial review, it appears that all the datasets are under the CC BY-NC-SA 4.0 license (I will look into this in more detail once we choose the dataset we want to focus on). However, access to the data is not straightforward and must be explicitly requested.

Please add any comments or feedback on these datasets as well. Thank you!

KennethEnevoldsen · 2024-12-12T23:15:37Z

Looks very promising and sounds like something we could start working on.

However, access to the data is not straightforward and must be explicitly requested.

Probably worth starting out by requesting these datasets

MattiaSangermano · 2024-12-18T17:00:05Z

Yes, I'm currently working on them. In the meantime, I'm also starting to upload the datasets to Hugging Face. Is this the preferred way to use them on MTEB right?

Samoed · 2024-12-18T17:30:37Z

Yes, that's the only way

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Italian tasks #1553

New Italian tasks #1553

MattiaSangermano commented Dec 4, 2024

KennethEnevoldsen commented Dec 5, 2024

MattiaSangermano commented Dec 12, 2024

KennethEnevoldsen commented Dec 12, 2024

MattiaSangermano commented Dec 18, 2024

Samoed commented Dec 18, 2024

New Italian tasks #1553

New Italian tasks #1553

Comments

MattiaSangermano commented Dec 4, 2024

KennethEnevoldsen commented Dec 5, 2024

MattiaSangermano commented Dec 12, 2024

KennethEnevoldsen commented Dec 12, 2024

MattiaSangermano commented Dec 18, 2024

Samoed commented Dec 18, 2024