MTEB Evaluation Running Time #140

stefanhgm · 2024-08-16T13:33:12Z

Hi everyone!

Thanks for developing LLM2Vec and making the source code available.

I was trying to reproduce LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised and train a model based on Llama 3.1 8B. I trained both models and now want to obtain results on the MTEB benchmark for comparison. Unfortunately, it seems to take very long to run the benchmark using the LLM2Vec models. I am currently done with the tasks CQADupstackWordpressRetrieval and ClimateFever (also see #135) and the next task (I think it is DBPedia) takes over 48h on a single A100 80GB. Is this the expected behavior? Can you share some insights about the running times of LLM2Vec on MTEB or share advice on how to speed it up?

I use the below snippet to run MTEB based on the script you provided:

    model = mteb.get_model(args.model_name, **model_kwargs)
    tasks = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
    evaluation = mteb.MTEB(tasks=tasks)
    results = evaluation.run(model, output_folder=args.output_dir)

Thanks for any help!

The text was updated successfully, but these errors were encountered:

vaibhavad · 2024-08-29T15:44:22Z

Hi @stefanhgm,

Yes, unfortunately evaluating 7B models on MTEB is an extremely long and arduous process. The only thing that can help speed up the evaluation is multi-GPU setup, in case that is available.

The library support multi-GPU evaluation without any code changes.

stefanhgm · 2024-08-29T15:54:54Z

Hi @vaibhavad,

Thanks for coming back on this! My experience on 4 GPUs is that it only get ~2.5x faster. Can you maybe give me an estimate on the overall running time or the time you needed to run on DBPedia if that's available?

Otherwise I will just try it again with a longer time interval or more GPUs. Thank you!

vaibhavad · 2024-08-29T23:34:43Z

Unfortunately, I don't remember the running time of DBPedia and I don't have the log files anymore. However, I do remember that out of all tasks, MSMARCO took the longest, which has 7 hours on 8 A100 GPUs. So DBPedia will be less than that.

vaibhavad · 2024-08-30T20:07:50Z

Hi @stefanhgm,

I just ran evaluation of DBPedia for Llama 3.1 8B model, it took 2.5 hours on 8 X H100 80GB GPUs.

stefanhgm · 2024-09-06T12:32:34Z

Thank you! That was helpful.

stefanhgm · 2024-09-23T11:52:36Z

Hi @vaibhavad,

sorry, I stumbled across another issue: Do we actually have to run the tasks for the train, dev and test sets (as it is done by default) or does test suffice? Because it seems these are the uploaded scores and this would drastically remove the running time.

I use the following code snippet to run the MTEB benchmark in mteb_eval.py:

 tasks = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
 evaluation = mteb.MTEB(tasks=tasks)
 results = evaluation.run(model, verbosity=2, output_folder=args.output_dir)

I looked for alternatives to filter only for the test datasets, but I did not find a straightforward way to do it. What approach did you use for creating the results for the MTEB leaderboard?

Thank you!

stefanhgm · 2024-09-27T10:11:08Z

I am now trying it with the following code only using the test sets:

tasks_orig = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
# Remove MSMARCO because it is evaluated on dev set
tasks = [t for t in tasks_orig if "MSMARCO" not in t.metadata.name]
evaluation = mteb.MTEB(tasks=tasks)
# Only run on test set for leaderboard, exception: MSMARCO manually on dev set
results = evaluation.run(model, eval_splits=["test"], verbosity=2, output_folder=args.output_dir)

nasosger · 2024-10-01T20:36:20Z

Hi @stefanhgm,
I am interested in llm2vec Mteb evaluation of custom models. I have trained a Gemma 2B with the bi-mntp-simcse setting, but cannot reproduce the evaluation script for the custom model.
Could you provide me with some details in the modifications that you had to do on Mteb source code? I imagine that for every custom model, the same modifications should work, given the correct versions of the libraries.

vaibhavad · 2024-10-02T18:57:52Z

Hi @stefanhgm ,

Do we actually have to run the tasks for the train, dev and test sets (as it is done by default) or does test suffice?

Just the test suffices, I believe you already figured out a way to run on just dev/test sets with MTEB package. Let me know if you need anything else.

stefanhgm · 2024-11-20T16:57:16Z

Hi @nasosger

sorry for the very late reply. I basically changed the things I pointed out earlier. Here is my mteb_eval.py:

import argparse
import mteb
from mteb.benchmarks import MTEB_MAIN_EN
import json

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name",
        type=str,
        default="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised",
    )
    parser.add_argument("--task_name", type=str, default="STS16")
    parser.add_argument("--task_types", type=str, default="")
    parser.add_argument("--do_mteb_main_en", action="store_true", default=False)
    parser.add_argument(
        "--task_to_instructions_fp",
        type=str,
        default="test_configs/mteb/task_to_instructions.json",
    )
    parser.add_argument("--output_dir", type=str, default="results")

    args = parser.parse_args()

    model_kwargs = {}
    if args.task_to_instructions_fp is not None:
        with open(args.task_to_instructions_fp, "r") as f:
            task_to_instructions = json.load(f)
        model_kwargs["task_to_instructions"] = task_to_instructions

    model = mteb.get_model(args.model_name, **model_kwargs)

    if args.do_mteb_main_en:
        tasks_orig = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
        # Remove MSMARCO
        # "Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used."
        # See: https://github.com/embeddings-benchmark/mteb
        tasks = [t for t in tasks_orig if "MSMARCO" not in t.metadata.name]
        assert len(tasks_orig) == 67 and len(tasks) == 66
    elif args.task_types:
        tasks = mteb.get_tasks(task_types=[args.task_types], languages=["eng"])
    else:
        tasks = mteb.get_tasks(tasks=[args.task_name], languages=["eng"])

    evaluation = mteb.MTEB(tasks=tasks)

    # Set logging to debug
    mteb.logger.setLevel(mteb.logging.DEBUG)
    # Only run on test set for leaderboard, exception: MSMARCO manually on dev set
    results = evaluation.run(model, eval_splits=["test"], verbosity=2, output_folder=args.output_dir)

BtlWolf · 2024-11-23T10:32:19Z

@stefanhgm Hi there, I hope you're doing well. I was wondering if you might be able to share your multi-card test code with me? I'd really appreciate it if you could.

stefanhgm · 2024-11-23T12:40:46Z

Hi @BtlWolf

I think the above script ran on multiple gpus automatically for me. I checked my running commands and there is no part where I explicitly enabled the multi-gpu setting.

BtlWolf · 2024-11-23T15:34:31Z

@stefanhgm
Specifically, I made the following errors when using multiple GPUS，Is this related to the platform? I am using Linux：
`
ERROR:mteb.evaluation.MTEB:Error while evaluating NFCorpus:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html

Traceback (most recent call last):
File "", line 1, in
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
prepare(preparation_data)
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 291, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "/data1/jiyifan/llm2vec-main/experiments/mteb_eval.py", line 21, in
evaluation.run(model, output_folder="/data1/jiyifan/llm2vec-main/mteb_result/Sheared-Llama-1.3B-eos_token")
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/MTEB.py", line 422, in run
raise e
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/MTEB.py", line 383, in run
results, tick, tock = self._run_eval(
^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/MTEB.py", line 260, in _run_eval
results = task.evaluate(
^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/abstasks/AbsTaskRetrieval.py", line 286, in evaluate
scores[hf_subset] = self._evaluate_subset(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/abstasks/AbsTaskRetrieval.py", line 295, in _evaluate_subset
results = retriever(corpus, queries)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/evaluators/RetrievalEvaluator.py", line 492, in call
return self.retriever.search(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/evaluators/RetrievalEvaluator.py", line 109, in search
query_embeddings = self.model.encode_queries(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/models/llm2vec_models.py", line 124, in encode_queries
return self.encode(queries, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/models/llm2vec_models.py", line 107, in encode
return self.model.encode(sentences, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/jiyifan/llm2vec-main/llm2vec/llm2vec.py", line 399, in encode
with cuda_compatible_multiprocess.Pool(num_proc) as p:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/pool.py", line 215, in init
self._repopulate_pool()
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/pool.py", line 329, in _repopulate_pool_static
w.start()
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
_check_not_importing_main()
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html `

BtlWolf · 2024-11-23T15:39:27Z

@stefanhgm I have solved this problem and need to add if name=="main" to mteb:

stefanhgm closed this as completed Sep 6, 2024

stefanhgm reopened this Sep 23, 2024

stefanhgm mentioned this issue Sep 27, 2024

Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTEB Evaluation Running Time #140

MTEB Evaluation Running Time #140

stefanhgm commented Aug 16, 2024

vaibhavad commented Aug 29, 2024

stefanhgm commented Aug 29, 2024

vaibhavad commented Aug 29, 2024

vaibhavad commented Aug 30, 2024 •

edited

Loading

stefanhgm commented Sep 6, 2024

stefanhgm commented Sep 23, 2024

stefanhgm commented Sep 27, 2024

nasosger commented Oct 1, 2024

vaibhavad commented Oct 2, 2024

stefanhgm commented Nov 20, 2024

BtlWolf commented Nov 23, 2024

stefanhgm commented Nov 23, 2024

BtlWolf commented Nov 23, 2024

BtlWolf commented Nov 23, 2024

MTEB Evaluation Running Time #140

MTEB Evaluation Running Time #140

Comments

stefanhgm commented Aug 16, 2024

vaibhavad commented Aug 29, 2024

stefanhgm commented Aug 29, 2024

vaibhavad commented Aug 29, 2024

vaibhavad commented Aug 30, 2024 • edited Loading

stefanhgm commented Sep 6, 2024

stefanhgm commented Sep 23, 2024

stefanhgm commented Sep 27, 2024

nasosger commented Oct 1, 2024

vaibhavad commented Oct 2, 2024

stefanhgm commented Nov 20, 2024

BtlWolf commented Nov 23, 2024

stefanhgm commented Nov 23, 2024

BtlWolf commented Nov 23, 2024

BtlWolf commented Nov 23, 2024

vaibhavad commented Aug 30, 2024 •

edited

Loading