Add Indic STS benchmark dataset #524

jaygala24 · 2024-04-23T14:18:59Z

Checklist for adding MMTEB dataset

Reason for dataset addition:

jaygala24 · 2024-04-23T14:25:39Z

Add Semantic Textual Similarity benchmark between English and 12 high-resource Indic languages.

Team:

Diganta Mista (@digantamisra98) - MILA
Jay Gala (@jaygala24) - AI4Bharat

KennethEnevoldsen

Looks great! Feel free to add the points as well. I only have one pointer related to size

mteb/tasks/STS/multilingual/IndicCrosslingualSTS.py

jaygala24 · 2024-04-23T15:18:05Z

Points Summary:

This is the first dataset for Indic languages in the STS task category. Hence, all the 12 Indic languages are new ones for this task category.

This dataset = 2 pts
Any new language not covered previously in STS task = 4 pts x 12 languages

Total points = 2 + 4 x 12 = 50 pts

KennethEnevoldsen

Thanks for downsampling. Will merge this PR in

mteb/tasks/STS/multilingual/IndicCrosslingualSTS.py

add Indic STS benchmark dataset

6571247

KennethEnevoldsen reviewed Apr 23, 2024

View reviewed changes

mteb/tasks/STS/multilingual/IndicCrosslingualSTS.py Show resolved Hide resolved

update metadata for Indic STS benchmark

f9f9029

jaygala24 added 3 commits April 23, 2024 20:52

add points for the contribution

902e88a

update reviewer name in points

68221be

downsample the test set size

0f01d27

KennethEnevoldsen approved these changes Apr 23, 2024

View reviewed changes

mteb/tasks/STS/multilingual/IndicCrosslingualSTS.py Outdated Show resolved Hide resolved

Update mteb/tasks/STS/multilingual/IndicCrosslingualSTS.py

ceb9726

KennethEnevoldsen enabled auto-merge (squash) April 23, 2024 18:15

KennethEnevoldsen merged commit 1f26615 into embeddings-benchmark:main Apr 23, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Indic STS benchmark dataset #524

Add Indic STS benchmark dataset #524

jaygala24 commented Apr 23, 2024 •

edited

Loading

jaygala24 commented Apr 23, 2024

KennethEnevoldsen left a comment

jaygala24 commented Apr 23, 2024 •

edited

Loading

KennethEnevoldsen left a comment

Add Indic STS benchmark dataset #524

Add Indic STS benchmark dataset #524

Conversation

jaygala24 commented Apr 23, 2024 • edited Loading

Checklist for adding MMTEB dataset

jaygala24 commented Apr 23, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

jaygala24 commented Apr 23, 2024 • edited Loading

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

jaygala24 commented Apr 23, 2024 •

edited

Loading

jaygala24 commented Apr 23, 2024 •

edited

Loading