fix: remove * imports (#1569)

* fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * Fix: Made data parsing in the leaderboard figure more robust (#1450) Bugfixes with data parsing in main figure * Fixed task loading (#1451) * Fixed task result loading from disk * Fixed task result loading from disk * fix: publish (#1452) * 1.19.6 Automatically generated by python-semantic-release * fix: Fix load external results with `None` mteb_version (#1453) * fix * lint * 1.19.7 Automatically generated by python-semantic-release * WIP: Polishing up leaderboard UI (#1461) * fix: Removed column wrapping on the table, so that it remains readable * Added disclaimer to figure * fix: Added links to task info table, switched out license with metric * fix: loading pre 1.11.0 (#1460) * small fix * fix: fix * 1.19.8 Automatically generated by python-semantic-release * fix: swap touche2020 to maintain compatibility (#1469) swap touche2020 for parity * 1.19.9 Automatically generated by python-semantic-release * docs: Add sum per language for task counts (#1468) * add sum per lang * add sort by sum option * make lint * fix: pinned datasets to <3.0.0 (#1470) * 1.19.10 Automatically generated by python-semantic-release * feat: add CUREv1 retrieval dataset (#1459) * feat: add CUREv1 dataset --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> * feat: add missing domains to medical tasks * feat: modify benchmark tasks * chore: benchmark naming --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> * Update tasks table * 1.20.0 Automatically generated by python-semantic-release * fix: check if `model` attr of model exists (#1499) * check if model attr of model exists * lint * Fix retrieval evaluator * 1.20.1 Automatically generated by python-semantic-release * fix: Leaderboard demo data loading (#1507) * Made get_scores error tolerant * Added join_revisions, made get_scores failsafe * Fetching metadata fixed fr HF models * Added failsafe metadata fetching to leaderboard code * Added revision joining to leaderboard app * fix * Only show models that have metadata, when filter_models is called * Ran linting * 1.20.2 Automatically generated by python-semantic-release * fix: leaderboard only shows models that have ModelMeta (#1508) Filtering for models that have metadata * 1.20.3 Automatically generated by python-semantic-release * fix: align readme with current mteb (#1493) * align readme with current mteb * align with mieb branch * fix test * 1.20.4 Automatically generated by python-semantic-release * docs: Add lang family mapping and map to task table (#1486) * add lang family mapping and map to task table * make lint * add back some unclassified lang codes * Update tasks table * fix: Ensure that models match the names on embedding-benchmarks/results (#1519) * 1.20.5 Automatically generated by python-semantic-release * fix: Adding missing metadata on models and mathcing names up with the results repo (#1528) * Added Voyage 3 models * Added correct metadata to Cohere models and matched names with the results repo * 1.20.6 Automatically generated by python-semantic-release * feat: Evaluate missing splits (#1525) * fix: evaluate missing splits (#1268) * implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]> * got test_all_splits_evaluated passing * tests passing * address review comments * make lint * handle None cases for kg_co2_emissions * use new results info --------- Co-authored-by: Thivyanth <[email protected]> * 1.21.0 Automatically generated by python-semantic-release * fix: Correct typos superseeded -> superseded (#1532) fix typo -> superseded * 1.21.1 Automatically generated by python-semantic-release * fix: Task load data error for SICK-BR-STS and XStance (#1534) * fix task load data for two tasks * correct dataset keys * 1.21.2 Automatically generated by python-semantic-release * fix: Proprietary models now get correctly shown in leaderboard (#1530) * Fixed showing proprietary models in leaderboard * Added links to all OpenAI models * Fixed table formatting issues * Bumped Gradio version * 1.21.3 Automatically generated by python-semantic-release * docs: Add Model Meta parameters and metadata (#1536) * add multi_qa_MiniLM_L6_cos_v1 model meta * add all_mpnet_base_v2 * add parameters to model meta * make lint * add extra params to meta * fix: add more model meta (jina, e5) (#1537) * add e5 model meta * address review comments * 1.21.4 Automatically generated by python-semantic-release * Add cohere models (#1538) * fix: bug cohere names * format * fix: add nomic models (#1543) #1515 * fix: Added all-minilm-l12-v2 (#1542) #1515 * fix: Added arctic models (#1541) #1515 * fix: add sentence trimming to OpenAIWrapper (#1526) * fix: add sentence trimming to OpenAIWrapper * fix: import tiktoken library inside encode function * fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name * fix: pass tokenizer_name, max_tokens to loader * fix: make tokenizer_name None for default * fix: delete changes for ModelMeta * fix: fix revision to 2 for OpenAI models * fix: add docstring for OpenAIWrapper * fix: lint * feat: add openai optional dependency set * fix: add sleep for too many requests * fix: add lint * fix: delete evaluate file * 1.21.5 Automatically generated by python-semantic-release * fix: Fixed metadata errors (#1547) * 1.21.6 Automatically generated by python-semantic-release * fix: remove curev1 from multlingual (#1552) Seems like it was added here: 1cc6c9e * 1.21.7 Automatically generated by python-semantic-release * fix: Add Model2vec (#1546) * Added Model2Vec wrapper * Added Model2vec models * Added model2vec models to registry * Added model2vec as a dependency * Ran linting * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Added adapted_from and superseeded_by to model2vec models. * Added missing import * Moved pyproject.toml to optional dependencies * Fixed typos * Added import error and changed model to model_name * Added Numpy to frameworks * Added Numpy to frameworks * Corrected false info on model2vec models * Replaced np.inf with maxint * Update mteb/models/model2vec_models.py Co-authored-by: Isaac Chung <[email protected]> * Added option to have infinite max tokens, added it to Model2vec --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: Isaac Chung <[email protected]> * Made result loading more permissive, changed eval splits for HotPotQA and DBPedia (#1554) * Removed train and dev from eval splits on HotpotQA * Removed dev from eval splits on DBPedia * Made task_results validation more permissive * Readded exception in get_score * Ran linting * 1.21.8 Automatically generated by python-semantic-release * docs: Correction of SICK-R metadata (#1558) * Correction of SICK-R metadata * Correction of SICK-R metadata --------- Co-authored-by: rposwiata <[email protected]> * feat(google_models): fix issues and add support for `text-embedding-005` and `text-multilingual-embedding-002` (#1562) * fix: google_models batching and prompt * feat: add text-embedding-005 and text-multilingual-embedding-002 * chore: `make lint` errors * fix: address PR comments * 1.22.0 Automatically generated by python-semantic-release * fix(bm25s): search implementation (#1566) fix: bm25s implementation * 1.22.1 Automatically generated by python-semantic-release * docs: Fix dependency library name for bm25s (#1568) * fix: bm25s implementation * correct library name --------- Co-authored-by: Daniel Buades Marcos <[email protected]> * fix: Add training dataset to model meta (#1561) * fix: Add training dataset to model meta Adresses #1556 * Added docs * format * feat: (cohere_models) cohere_task_type issue, batch requests and tqdm for visualization (#1564) * feat: batch requests to cohere models * fix: use correct task_type * feat: use tqdm with openai * fix: explicitely set `show_progress_bar` to False * fix(publichealth-qa): ignore rows with `None` values in `question` or `answer` (#1565) * 1.23.0 Automatically generated by python-semantic-release * fix wongnai * update inits * fix tests * lint * update imports * fix tests * lint --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Napuh <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Thivyanth <[email protected]> Co-authored-by: Youngjoon Jang <[email protected]> Co-authored-by: Rafał Poświata <[email protected]>
embeddings-benchmark · Dec 9, 2024 · d0aa3a7 · d0aa3a7
1 parent dec5d6a
commit d0aa3a7
Show file tree

Hide file tree

Showing 207 changed files with 69,186 additions and 819 deletions.
diff --git a/README.md b/README.md
@@ -46,10 +46,8 @@ from sentence_transformers import SentenceTransformer
 
 # Define the sentence-transformers model name
 model_name = "average_word_embeddings_komninos"
-# or directly from huggingface:
-# model_name = "sentence-transformers/all-MiniLM-L6-v2"
 
-model = SentenceTransformer(model_name)
+model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)
 tasks = mteb.get_tasks(tasks=["Banking77Classification"])
 evaluation = mteb.MTEB(tasks=tasks)
 results = evaluation.run(model, output_folder=f"results/{model_name}")
@@ -221,7 +219,10 @@ Note that the public leaderboard uses the test splits for all datasets except MS
 Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper.
 
 ```python
+import mteb
 from mteb.encoder_interface import PromptType
+import numpy as np
+
 
 class CustomModel:
     def encode(
@@ -245,7 +246,7 @@ class CustomModel:
         pass
 
 model = CustomModel()
-tasks = mteb.get_task("Banking77Classification")
+tasks = mteb.get_tasks(tasks=["Banking77Classification"])
 evaluation = MTEB(tasks=tasks)
 evaluation.run(model)
 ```
@@ -379,6 +380,28 @@ results = mteb.load_results(models=models, tasks=tasks)
 df = results_to_dataframe(results)
 ```
 
+</details>
+
+
+<details>
+  <summary>  Annotate Contamination in the training data of a model  </summary>
+
+### Annotate Contamination
+
+have your found contamination in the training data of a model? Please let us know, either by opening an issue or ideally by submitting a PR
+annotatig the training datasets of the model:
+
+```py
+model_w_contamination = ModelMeta(
+    name = "model-with-contamination"
+    ...
+    training_datasets: {"ArguAna": # name of dataset within MTEB
+                        ["test"]} # the splits that have been trained on
+    ...
+)
+```
+
+
 </details>
 
 <details>

diff --git a/docs/create_tasks_table.py b/docs/create_tasks_table.py
@@ -8,6 +8,7 @@
 
 import mteb
 from mteb.abstasks.TaskMetadata import PROGRAMMING_LANGS, TASK_TYPE
+from mteb.languages import ISO_TO_FAM_LEVEL0, ISO_TO_LANGUAGE
 
 
 def author_from_bibtex(bibtex: str | None) -> str:
@@ -82,10 +83,21 @@ def create_task_lang_table(tasks: list[mteb.AbsTask], sort_by_sum=False) -> str:
     ## Wrangle for polars
     pl_table_dict = []
     for lang, d in table_dict.items():
-        d.update({"0-lang": lang})  # for sorting columns
+        d.update({"0-lang-code": lang})  # for sorting columns
         pl_table_dict.append(d)
 
-    df = pl.DataFrame(pl_table_dict).sort(by="0-lang")
+    df = pl.DataFrame(pl_table_dict).sort(by="0-lang-code")
+    df = df.with_columns(
+        pl.col("0-lang-code")
+        .replace_strict(ISO_TO_LANGUAGE, default="unknown")
+        .alias("1-lang-name")
+    )
+    df = df.with_columns(
+        pl.col("0-lang-code")
+        .replace_strict(ISO_TO_FAM_LEVEL0, default="Unclassified")
+        .alias("2-lang-fam")
+    )
+
     df = df.with_columns(sum=pl.sum_horizontal(get_args(TASK_TYPE)))
     df = df.select(sorted(df.columns))
     if sort_by_sum:
@@ -96,7 +108,7 @@ def create_task_lang_table(tasks: list[mteb.AbsTask], sort_by_sum=False) -> str:
     task_names_md = " | ".join(sorted(get_args(TASK_TYPE)))
     horizontal_line_md = "---|---" * (len(sorted(get_args(TASK_TYPE))) + 1)
     table = f"""
-| Language | {task_names_md} | Sum |
+| ISO Code | Language | Family | {task_names_md} | Sum |
 |{horizontal_line_md}|
 """
 
@@ -119,14 +131,14 @@ def insert_tables(
     file_path: str, tables: list[str], tags: list[str] = ["TASKS TABLE"]
 ) -> None:
     """Insert tables within <!-- TABLE START --> and <!-- TABLE END --> or similar tags."""
-    md = Path(file_path).read_text()
+    md = Path(file_path).read_text(encoding="utf-8")
 
     for table, tag in zip(tables, tags):
         start = f"<!-- {tag} START -->"
         end = f"<!-- {tag} END -->"
         md = md.replace(md[md.index(start) + len(start) : md.index(end)], table)
 
-    Path(file_path).write_text(md)
+    Path(file_path).write_text(md, encoding="utf-8")
 
 
 def main():

diff --git a/mteb/__init__.py b/mteb/__init__.py
@@ -10,17 +10,23 @@
     MTEB_RETRIEVAL_WITH_INSTRUCTIONS,
     CoIR,
 )
-from mteb.evaluation import *
+from mteb.encoder_interface import Encoder
+from mteb.evaluation import MTEB
 from mteb.load_results import BenchmarkResults, load_results
-from mteb.models import get_model, get_model_meta, get_model_metas
+from mteb.load_results.task_results import TaskResult
+from mteb.models import (
+    SentenceTransformerWrapper,
+    get_model,
+    get_model_meta,
+    get_model_metas,
+)
 from mteb.overview import TASKS_REGISTRY, get_task, get_tasks
 
 from .benchmarks.benchmarks import Benchmark
 from .benchmarks.get_benchmark import BENCHMARK_REGISTRY, get_benchmark, get_benchmarks
 
 __version__ = version("mteb")  # fetch version from install metadata
 
-
 __all__ = [
     "MTEB_ENG_CLASSIC",
     "MTEB_MAIN_RU",
@@ -40,4 +46,8 @@
     "get_benchmarks",
     "BenchmarkResults",
     "BENCHMARK_REGISTRY",
+    "MTEB",
+    "TaskResult",
+    "SentenceTransformerWrapper",
+    "Encoder",
 ]
diff --git a/mteb/abstasks/AbsTask.py b/mteb/abstasks/AbsTask.py
@@ -72,11 +72,11 @@ def __init__(self, seed: int = 42, **kwargs: Any):
         torch.manual_seed(self.seed)
         torch.cuda.manual_seed_all(self.seed)
 
-    def check_if_dataset_is_superseeded(self):
-        """Check if the dataset is superseeded by a newer version"""
+    def check_if_dataset_is_superseded(self):
+        """Check if the dataset is superseded by a newer version"""
         if self.superseded_by:
             logger.warning(
-                f"Dataset '{self.metadata.name}' is superseeded by '{self.superseded_by}', you might consider using the newer version of the dataset."
+                f"Dataset '{self.metadata.name}' is superseded by '{self.superseded_by}', you might consider using the newer version of the dataset."
             )
 
     def dataset_transform(self):

diff --git a/mteb/abstasks/TaskMetadata.py b/mteb/abstasks/TaskMetadata.py
@@ -168,6 +168,7 @@
         "cc0-1.0",
         "bsd-3-clause",
         "gpl-3.0",
+        "lgpl-3.0",
         "cdla-sharing-1.0",
         "mpl-2.0",
     ]

diff --git a/mteb/abstasks/__init__.py b/mteb/abstasks/__init__.py
@@ -1,15 +1,33 @@
 from __future__ import annotations
 
-from ..evaluation.LangMapping import *
-from .AbsTask import *
-from .AbsTaskBitextMining import *
-from .AbsTaskClassification import *
-from .AbsTaskClustering import *
-from .AbsTaskMultilabelClassification import *
-from .AbsTaskPairClassification import *
-from .AbsTaskReranking import *
-from .AbsTaskRetrieval import *
-from .AbsTaskSpeedTask import *
-from .AbsTaskSTS import *
-from .AbsTaskSummarization import *
-from .MultilingualTask import *
+from .AbsTask import AbsTask
+from .AbsTaskBitextMining import AbsTaskBitextMining
+from .AbsTaskClassification import AbsTaskClassification
+from .AbsTaskClustering import AbsTaskClustering
+from .AbsTaskClusteringFast import AbsTaskClusteringFast
+from .AbsTaskMultilabelClassification import AbsTaskMultilabelClassification
+from .AbsTaskPairClassification import AbsTaskPairClassification
+from .AbsTaskReranking import AbsTaskReranking
+from .AbsTaskRetrieval import AbsTaskRetrieval
+from .AbsTaskSpeedTask import AbsTaskSpeedTask
+from .AbsTaskSTS import AbsTaskSTS
+from .AbsTaskSummarization import AbsTaskSummarization
+from .MultilingualTask import MultilingualTask
+from .TaskMetadata import TaskMetadata
+
+__all__ = [
+    "AbsTask",
+    "AbsTaskBitextMining",
+    "AbsTaskClassification",
+    "AbsTaskClustering",
+    "AbsTaskClusteringFast",
+    "AbsTaskMultilabelClassification",
+    "AbsTaskPairClassification",
+    "AbsTaskReranking",
+    "AbsTaskRetrieval",
+    "AbsTaskSpeedTask",
+    "AbsTaskSTS",
+    "AbsTaskSummarization",
+    "MultilingualTask",
+    "TaskMetadata",
+]
diff --git a/mteb/benchmarks/__init__.py b/mteb/benchmarks/__init__.py
@@ -1,4 +1,57 @@
 from __future__ import annotations
 
-from mteb.benchmarks.benchmarks import *
-from mteb.benchmarks.get_benchmark import *
+from mteb.benchmarks.benchmarks import (
+    BRIGHT,
+    LONG_EMBED,
+    MTEB_DEU,
+    MTEB_EN,
+    MTEB_ENG_CLASSIC,
+    MTEB_EU,
+    MTEB_FRA,
+    MTEB_INDIC,
+    MTEB_JPN,
+    MTEB_KOR,
+    MTEB_MAIN_RU,
+    MTEB_MINERS_BITEXT_MINING,
+    MTEB_POL,
+    MTEB_RETRIEVAL_LAW,
+    MTEB_RETRIEVAL_MEDICAL,
+    MTEB_RETRIEVAL_WITH_INSTRUCTIONS,
+    SEB,
+    Benchmark,
+    CoIR,
+    MTEB_code,
+    MTEB_multilingual,
+)
+from mteb.benchmarks.get_benchmark import (
+    BENCHMARK_REGISTRY,
+    get_benchmark,
+    get_benchmarks,
+)
+
+__all__ = [
+    "Benchmark",
+    "MTEB_EN",
+    "MTEB_ENG_CLASSIC",
+    "MTEB_MAIN_RU",
+    "MTEB_RETRIEVAL_WITH_INSTRUCTIONS",
+    "MTEB_RETRIEVAL_LAW",
+    "MTEB_RETRIEVAL_MEDICAL",
+    "MTEB_MINERS_BITEXT_MINING",
+    "SEB",
+    "CoIR",
+    "MTEB_FRA",
+    "MTEB_DEU",
+    "MTEB_KOR",
+    "MTEB_POL",
+    "MTEB_code",
+    "MTEB_multilingual",
+    "MTEB_JPN",
+    "MTEB_INDIC",
+    "MTEB_EU",
+    "LONG_EMBED",
+    "BRIGHT",
+    "BENCHMARK_REGISTRY",
+    "get_benchmarks",
+    "get_benchmark",
+]
diff --git a/mteb/descriptive_stats/Classification/Ddisco.json b/mteb/descriptive_stats/Classification/Ddisco.json
@@ -0,0 +1,44 @@
+{
+    "test": {
+        "num_samples": 201,
+        "number_of_characters": 200062,
+        "number_texts_intersect_with_train": 1,
+        "min_text_length": 529,
+        "average_text_length": 995.3333333333334,
+        "max_text_length": 2050,
+        "unique_text": 201,
+        "unique_labels": 3,
+        "labels": {
+            "2": {
+                "count": 76
+            },
+            "3": {
+                "count": 115
+            },
+            "1": {
+                "count": 10
+            }
+        }
+    },
+    "train": {
+        "num_samples": 801,
+        "number_of_characters": 779241,
+        "number_texts_intersect_with_train": null,
+        "min_text_length": 492,
+        "average_text_length": 972.8352059925094,
+        "max_text_length": 2411,
+        "unique_text": 796,
+        "unique_labels": 3,
+        "labels": {
+            "1": {
+                "count": 30
+            },
+            "2": {
+                "count": 325
+            },
+            "3": {
+                "count": 446
+            }
+        }
+    }
+}
diff --git a/mteb/descriptive_stats/Classification/GeorgianSentimentClassification.json b/mteb/descriptive_stats/Classification/GeorgianSentimentClassification.json
@@ -0,0 +1,38 @@
+{
+    "test": {
+        "num_samples": 1200,
+        "number_of_characters": 141679,
+        "number_texts_intersect_with_train": 0,
+        "min_text_length": 25,
+        "average_text_length": 118.06583333333333,
+        "max_text_length": 566,
+        "unique_text": 1200,
+        "unique_labels": 2,
+        "labels": {
+            "1": {
+                "count": 600
+            },
+            "0": {
+                "count": 600
+            }
+        }
+    },
+    "train": {
+        "num_samples": 330,
+        "number_of_characters": 37706,
+        "number_texts_intersect_with_train": null,
+        "min_text_length": 19,
+        "average_text_length": 114.26060606060607,
+        "max_text_length": 315,
+        "unique_text": 330,
+        "unique_labels": 2,
+        "labels": {
+            "1": {
+                "count": 165
+            },
+            "0": {
+                "count": 165
+            }
+        }
+    }
+}