-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add Multilingual Hate Speech detection task #439
fix: add Multilingual Hate Speech detection task #439
Conversation
First stab at this, but before I go ahead would love to have input on the following:
|
Let use use both.
Given the license I would rehost it. Feel free to host it on the mteb, hf group (you can request to join and I will add you) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, this looks really promising, @rbroc.I think with rehost and downsampling, it will be great.
Also note that the point system changed (to avoid merge conflicts) - see #438
mteb/tasks/Classification/multilingual/MultiHateClassification.py
Outdated
Show resolved
Hide resolved
mteb/tasks/Classification/multilingual/MultiHateClassification.py
Outdated
Show resolved
Hide resolved
thanks @KennethEnevoldsen!
|
Yep!
Let reference be the newest, but the BibTeX citation can be just two BibTeX citations. |
awesome thanks! i'll most probably have to focus on other stuff rest of today, but hoping to wrap this up latest early next week. |
@KennethEnevoldsen I have implemented requested changes, new dataset hosted here: https://huggingface.co/datasets/mteb/multi-hatecheck I have kept all data from the original dataset in the HF dataset, as well as additional interesting columns one could focus on in the future (type of hate speech). I do subsampling and splitting here, easier to deal with if we want to later increase the number of samples. Not sure how what the point system for this should be, is this 1 or 9 datasets? Also feels like repeated review should grant more than 2x points for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KennethEnevoldsen should be ready to merge |
Ahh @rbroc sorry missed the question related to points: You should get 2 points for the dataset and 4 bonus points pr. language that does not have a classification task. |
no worries at all! it seems like all languages are covered in multilingual datasets, so the 2 I added is correct. :) |
Checklist for adding MMTEB dataset
closes #395
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
make test
.make lint
.