Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTEB Evaluation Running Time #140

Open
stefanhgm opened this issue Aug 16, 2024 · 14 comments
Open

MTEB Evaluation Running Time #140

stefanhgm opened this issue Aug 16, 2024 · 14 comments

Comments

@stefanhgm
Copy link

Hi everyone!

Thanks for developing LLM2Vec and making the source code available.

I was trying to reproduce LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised and train a model based on Llama 3.1 8B. I trained both models and now want to obtain results on the MTEB benchmark for comparison. Unfortunately, it seems to take very long to run the benchmark using the LLM2Vec models. I am currently done with the tasks CQADupstackWordpressRetrieval and ClimateFever (also see #135) and the next task (I think it is DBPedia) takes over 48h on a single A100 80GB. Is this the expected behavior? Can you share some insights about the running times of LLM2Vec on MTEB or share advice on how to speed it up?

I use the below snippet to run MTEB based on the script you provided:

    model = mteb.get_model(args.model_name, **model_kwargs)
    tasks = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
    evaluation = mteb.MTEB(tasks=tasks)
    results = evaluation.run(model, output_folder=args.output_dir)

Thanks for any help!

@vaibhavad
Copy link
Collaborator

Hi @stefanhgm,

Yes, unfortunately evaluating 7B models on MTEB is an extremely long and arduous process. The only thing that can help speed up the evaluation is multi-GPU setup, in case that is available.

The library support multi-GPU evaluation without any code changes.

@stefanhgm
Copy link
Author

Hi @vaibhavad,

Thanks for coming back on this! My experience on 4 GPUs is that it only get ~2.5x faster. Can you maybe give me an estimate on the overall running time or the time you needed to run on DBPedia if that's available?

Otherwise I will just try it again with a longer time interval or more GPUs. Thank you!

@vaibhavad
Copy link
Collaborator

Unfortunately, I don't remember the running time of DBPedia and I don't have the log files anymore. However, I do remember that out of all tasks, MSMARCO took the longest, which has 7 hours on 8 A100 GPUs. So DBPedia will be less than that.

@vaibhavad
Copy link
Collaborator

vaibhavad commented Aug 30, 2024

Hi @stefanhgm,

I just ran evaluation of DBPedia for Llama 3.1 8B model, it took 2.5 hours on 8 X H100 80GB GPUs.

@stefanhgm
Copy link
Author

Thank you! That was helpful.

@stefanhgm
Copy link
Author

Hi @vaibhavad,

sorry, I stumbled across another issue: Do we actually have to run the tasks for the train, dev and test sets (as it is done by default) or does test suffice? Because it seems these are the uploaded scores and this would drastically remove the running time.

I use the following code snippet to run the MTEB benchmark in mteb_eval.py:

 tasks = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
 evaluation = mteb.MTEB(tasks=tasks)
 results = evaluation.run(model, verbosity=2, output_folder=args.output_dir)

I looked for alternatives to filter only for the test datasets, but I did not find a straightforward way to do it. What approach did you use for creating the results for the MTEB leaderboard?

Thank you!

@stefanhgm
Copy link
Author

I am now trying it with the following code only using the test sets:

tasks_orig = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
# Remove MSMARCO because it is evaluated on dev set
tasks = [t for t in tasks_orig if "MSMARCO" not in t.metadata.name]
evaluation = mteb.MTEB(tasks=tasks)
# Only run on test set for leaderboard, exception: MSMARCO manually on dev set
results = evaluation.run(model, eval_splits=["test"], verbosity=2, output_folder=args.output_dir)

@nasosger
Copy link

nasosger commented Oct 1, 2024

Hi @stefanhgm,
I am interested in llm2vec Mteb evaluation of custom models. I have trained a Gemma 2B with the bi-mntp-simcse setting, but cannot reproduce the evaluation script for the custom model.
Could you provide me with some details in the modifications that you had to do on Mteb source code? I imagine that for every custom model, the same modifications should work, given the correct versions of the libraries.

@vaibhavad
Copy link
Collaborator

Hi @stefanhgm ,

Do we actually have to run the tasks for the train, dev and test sets (as it is done by default) or does test suffice?

Just the test suffices, I believe you already figured out a way to run on just dev/test sets with MTEB package. Let me know if you need anything else.

@stefanhgm
Copy link
Author

Hi @nasosger

sorry for the very late reply. I basically changed the things I pointed out earlier. Here is my mteb_eval.py:

import argparse
import mteb
from mteb.benchmarks import MTEB_MAIN_EN
import json

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name",
        type=str,
        default="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised",
    )
    parser.add_argument("--task_name", type=str, default="STS16")
    parser.add_argument("--task_types", type=str, default="")
    parser.add_argument("--do_mteb_main_en", action="store_true", default=False)
    parser.add_argument(
        "--task_to_instructions_fp",
        type=str,
        default="test_configs/mteb/task_to_instructions.json",
    )
    parser.add_argument("--output_dir", type=str, default="results")

    args = parser.parse_args()

    model_kwargs = {}
    if args.task_to_instructions_fp is not None:
        with open(args.task_to_instructions_fp, "r") as f:
            task_to_instructions = json.load(f)
        model_kwargs["task_to_instructions"] = task_to_instructions

    model = mteb.get_model(args.model_name, **model_kwargs)

    if args.do_mteb_main_en:
        tasks_orig = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
        # Remove MSMARCO
        # "Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used."
        # See: https://github.com/embeddings-benchmark/mteb
        tasks = [t for t in tasks_orig if "MSMARCO" not in t.metadata.name]
        assert len(tasks_orig) == 67 and len(tasks) == 66
    elif args.task_types:
        tasks = mteb.get_tasks(task_types=[args.task_types], languages=["eng"])
    else:
        tasks = mteb.get_tasks(tasks=[args.task_name], languages=["eng"])

    evaluation = mteb.MTEB(tasks=tasks)

    # Set logging to debug
    mteb.logger.setLevel(mteb.logging.DEBUG)
    # Only run on test set for leaderboard, exception: MSMARCO manually on dev set
    results = evaluation.run(model, eval_splits=["test"], verbosity=2, output_folder=args.output_dir)

@BtlWolf
Copy link

BtlWolf commented Nov 23, 2024

@stefanhgm Hi there, I hope you're doing well. I was wondering if you might be able to share your multi-card test code with me? I'd really appreciate it if you could.

@stefanhgm
Copy link
Author

Hi @BtlWolf

I think the above script ran on multiple gpus automatically for me. I checked my running commands and there is no part where I explicitly enabled the multi-gpu setting.

@BtlWolf
Copy link

BtlWolf commented Nov 23, 2024

@stefanhgm
Specifically, I made the following errors when using multiple GPUS,Is this related to the platform? I am using Linux:
`
ERROR:mteb.evaluation.MTEB:Error while evaluating NFCorpus:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html  

Traceback (most recent call last):
File "", line 1, in
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
prepare(preparation_data)
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 291, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "/data1/jiyifan/llm2vec-main/experiments/mteb_eval.py", line 21, in
evaluation.run(model, output_folder="/data1/jiyifan/llm2vec-main/mteb_result/Sheared-Llama-1.3B-eos_token")
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/MTEB.py", line 422, in run
raise e
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/MTEB.py", line 383, in run
results, tick, tock = self._run_eval(
^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/MTEB.py", line 260, in _run_eval
results = task.evaluate(
^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/abstasks/AbsTaskRetrieval.py", line 286, in evaluate
scores[hf_subset] = self._evaluate_subset(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/abstasks/AbsTaskRetrieval.py", line 295, in _evaluate_subset
results = retriever(corpus, queries)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/evaluators/RetrievalEvaluator.py", line 492, in call
return self.retriever.search(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/evaluation/evaluators/RetrievalEvaluator.py", line 109, in search
query_embeddings = self.model.encode_queries(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/models/llm2vec_models.py", line 124, in encode_queries
return self.encode(queries, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/site-packages/mteb/models/llm2vec_models.py", line 107, in encode
return self.model.encode(sentences, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/jiyifan/llm2vec-main/llm2vec/llm2vec.py", line 399, in encode
with cuda_compatible_multiprocess.Pool(num_proc) as p:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/pool.py", line 215, in init
self._repopulate_pool()
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/pool.py", line 329, in _repopulate_pool_static
w.start()
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
_check_not_importing_main()
File "/home/jiyifan/miniconda3/envs/llm2vec/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html `

@BtlWolf
Copy link

BtlWolf commented Nov 23, 2024

@stefanhgm I have solved this problem and need to add if name=="main" to mteb:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants