fix: add Multilingual Hate Speech detection task #439

rbroc · 2024-04-19T09:07:34Z

Checklist for adding MMTEB dataset

closes #395

I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.
I have added points for my submission to the POINTS.md file.

rbroc · 2024-04-19T09:14:07Z

First stab at this, but before I go ahead would love to have input on the following:

The dataset includes 10 languages. First release as part of a 2021 ACL paper, was a monolingual English dataset. Then the remaining 9 languages were released, expanding the original dataset. There is a separate 2022 workshop paper for these additional datasets. I am using the latter for reference and bibtex_citation, but any input on this? Hope we are good license-wise if we do that?
All datasets are released separately on HF. I am passing a dictionary with language-specific revision tags for each dataset to revision. Feels a bit weird though, any better suggestion?
The name of the dataset passed to metadata is technically the name of one of the datasets, is that an issue?
n_samples and avg_character_length are computed across all languages. n_samples is 18250 in total. This seems high, but we are benchmarking separately on each monolingual dataset -- so size should be okay? Or do I need to downsample?

KennethEnevoldsen · 2024-04-19T09:20:06Z

The dataset includes 10 languages. First release as part of a 2021 ACL paper, was a monolingual English dataset. Then the remaining 9 languages were released, expanding the original dataset. There is a separate 2022 workshop paper for these additional datasets. I am using the latter for reference and bibtex_citation, but any input on this?

Let use use both.

All datasets are released separately on HF. I am passing a dictionary with language-specific revision tags for each dataset to revision. Feels a bit weird though, any better suggestion?
The name of the dataset passed to metadata is technically the name of one of the datasets, is that an issue?

Given the license I would rehost it. Feel free to host it on the mteb, hf group (you can request to join and I will add you)

KennethEnevoldsen

generally, this looks really promising, @rbroc.I think with rehost and downsampling, it will be great.

Also note that the point system changed (to avoid merge conflicts) - see #438

mteb/tasks/Classification/multilingual/MultiHateClassification.py

rbroc · 2024-04-19T09:29:56Z

thanks @KennethEnevoldsen!
double-checking real quick:

making this a single dataset on HF, hosted on mteb -- not multiple dataset, correct?
when you say crediting both papers, how would i do that? reference and bibtex_citation metadata fields can only take strings -- i could of course reference those in the README of the new dataset, but we're still left with what these two metadata fields should look like in the new Task.

KennethEnevoldsen · 2024-04-19T09:34:41Z

making this a single dataset on HF, hosted on mteb -- not multiple dataset, correct?

Yep!

when you say crediting both papers, how would i do that? reference and bibtex_citation metadata fields can only take strings -- i could of course reference those in the README of the new dataset, but we're still left with what these two metadata fields should look like in the new Task.

Let reference be the newest, but the BibTeX citation can be just two BibTeX citations.

rbroc · 2024-04-19T09:44:22Z

awesome thanks! i'll most probably have to focus on other stuff rest of today, but hoping to wrap this up latest early next week.

rbroc · 2024-04-19T21:47:19Z

@KennethEnevoldsen I have implemented requested changes, new dataset hosted here: https://huggingface.co/datasets/mteb/multi-hatecheck

I have kept all data from the original dataset in the HF dataset, as well as additional interesting columns one could focus on in the future (type of hate speech). I do subsampling and splitting here, easier to deal with if we want to later increase the number of samples.

Not sure how what the point system for this should be, is this 1 or 9 datasets? Also feels like repeated review should grant more than 2x points for you?

KennethEnevoldsen

@rbroc you might be interested in #440 for the multilabel case. Otherwise, everything looks good!

Will you add in the points

docs/mmteb/points/439.jsonl

rbroc · 2024-04-23T15:23:03Z

@KennethEnevoldsen should be ready to merge

KennethEnevoldsen · 2024-04-23T16:14:05Z

Ahh @rbroc sorry missed the question related to points:

You should get 2 points for the dataset and 4 bonus points pr. language that does not have a classification task.

rbroc · 2024-04-24T05:06:10Z

no worries at all! it seems like all languages are covered in multilingual datasets, so the 2 I added is correct. :)

KennethEnevoldsen reviewed Apr 19, 2024

View reviewed changes

mteb/tasks/Classification/multilingual/MultiHateClassification.py Outdated Show resolved Hide resolved

mteb/tasks/Classification/multilingual/MultiHateClassification.py Outdated Show resolved Hide resolved

KennethEnevoldsen self-assigned this Apr 19, 2024

rbroc added 7 commits April 19, 2024 23:23

resolve conflicts

b7a1e71

remove legacy results

3cf9757

remove legacy itahate results

0c061b7

load from new mteb-hosted dataset and align metadata

1b64310

add results

bf7dfc6

complete rebase

79709a5

create points file

70917fb

rbroc force-pushed the multi-hate branch from 110ec0b to 70917fb Compare April 19, 2024 21:33

rbroc added 2 commits April 19, 2024 23:36

remove old ita hatespeech results

fff31d7

restore correct init file for classification tasks

a8b1c9d

rerun with all languages

07f8325

rbroc mentioned this pull request Apr 22, 2024

Added PolHateClassification task #465

Closed

10 tasks

rbroc requested a review from KennethEnevoldsen April 23, 2024 10:51

KennethEnevoldsen approved these changes Apr 23, 2024

View reviewed changes

docs/mmteb/points/439.jsonl Outdated Show resolved Hide resolved

add points

9b2f6d8

rbroc changed the title ~~Multilingual Hate Speech detection task~~ fix: add Multilingual Hate Speech detection task Apr 23, 2024

KennethEnevoldsen merged commit eee7175 into embeddings-benchmark:main Apr 24, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add Multilingual Hate Speech detection task #439

fix: add Multilingual Hate Speech detection task #439

rbroc commented Apr 19, 2024 •

edited

Loading

rbroc commented Apr 19, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 19, 2024

KennethEnevoldsen left a comment

rbroc commented Apr 19, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 19, 2024

rbroc commented Apr 19, 2024

rbroc commented Apr 19, 2024 •

edited

Loading

KennethEnevoldsen left a comment

rbroc commented Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024

rbroc commented Apr 24, 2024

fix: add Multilingual Hate Speech detection task #439

fix: add Multilingual Hate Speech detection task #439

Conversation

rbroc commented Apr 19, 2024 • edited Loading

Checklist for adding MMTEB dataset

rbroc commented Apr 19, 2024 • edited Loading

KennethEnevoldsen commented Apr 19, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

rbroc commented Apr 19, 2024 • edited Loading

KennethEnevoldsen commented Apr 19, 2024

rbroc commented Apr 19, 2024

rbroc commented Apr 19, 2024 • edited Loading

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

rbroc commented Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024

rbroc commented Apr 24, 2024

rbroc commented Apr 19, 2024 •

edited

Loading

rbroc commented Apr 19, 2024 •

edited

Loading

rbroc commented Apr 19, 2024 •

edited

Loading

rbroc commented Apr 19, 2024 •

edited

Loading