Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset #135

viet-data · 2024-08-04T05:27:20Z

I have been attempting to reproduce the training results on the same echo data. Due to hardware limitations, I had to reimplement the training process using GradCache.

Although my model code can load the LLM2Vec public checkpoint and perform inference correctly, I am unable to achieve comparable performance to LLM2Vec when training a bidirectional Mistral model (without MNTP and unsupervised SimCSE) using GradCache. My training used a batch size of 512 on the echo dataset and stopped after 750 iterations.

Specifically, on the STS tasks, I have not been able to exceed 75 on SICK-R and 65 on STS-12 (other tasks also show low performance, except for BIOSSES).

Has anyone else tried to train LLM2Vec with GradCache, or has anyone successfully reproduced the LLM2Vec results using the original code? Any insights or suggestions would be greatly appreciated.

vaibhavad · 2024-08-06T06:30:27Z

Hello @viet-data,

where you able to reproduce with GradCache? If you are interested, we'll like to integrate GradCache to the LLM2Vec library

viet-data · 2024-08-06T08:11:00Z

Hi @vaibhavad ,

I have successfully trained with gradcache, using a batch size of 128, and achieved results close to those reported in llm2vec. However, I'm curious about llm2vec's performance when scaling up the data. I haven't been able to improve performance with more training data, which might be due to the smaller batch size.

Could you share the llm2vec results when training on the full dataset? It would be very useful if you could integrate gradcache into llm2vec to help us train with fewer GPUs. Thank you.

stefanhgm · 2024-08-09T12:16:31Z

Hi @viet-data,

I reproduced the Llama 3 supervised version trained 1000 steps on the MNTP task and 1000 steps on the E5 dataset (according to the original LLM2Vec) training configs. I currently run the full MTEB evaluation, but the first results look very similar to the ones reported on HuggingFace for the model.

I am currently training a Llama 3.1 version with the same training recipe.

viet-data · 2024-08-12T03:27:56Z

@stefanhgm Thanks so much for sharing! I agree, llm2vec seems quite reproducible. Excited to see your results with Llama 3.1

stefanhgm · 2024-08-13T12:51:27Z

Currently, the evaluation of the Llama 3.1 version on MTEB hangs on a task where repeatedly 391 batches are processed. It already repeats this for over a day now. I think it is the DBPedia task as the CQADupstackWordpressRetrieval and ClimateFEVER tasks were completed last and DBPedia should come next.

@vaibhavad any chance you observed a similar behavior when evaluating on MTEB?

atutej · 2024-09-20T18:47:55Z

Hi @stefanhgm!

Would it be possible to share your reproduced numbers? I am currently following the LLM2Vec recipe and for some benchmarks (eg: FiQA2018) the numbers I get are way off from what was reported. 48% vs 55%...

stefanhgm · 2024-09-27T09:25:50Z

Hi @atutej,

I still have problem with running all tasks because the running time is just very long even when using multiple GPUs. I also asked a question regarding there here: #140

I got FiQA2018 running though and obtained "main_score": 0.55441 on the test set, so for me the results seem reproducible.

vaibhavad · 2024-10-02T18:59:13Z

Currently, the evaluation of the Llama 3.1 version on MTEB hangs on a task where repeatedly 391 batches are processed. It already repeats this for over a day now. I think it is the DBPedia task as the CQADupstackWordpressRetrieval and ClimateFEVER tasks were completed last and DBPedia should come next.

@vaibhavad any chance you observed a similar behavior when evaluating on MTEB?

This is strange behaviour, I haven't faced this issue. Can you share your evaluation script?

atutej · 2024-10-28T01:28:44Z

Hi @stefanhgm

I think the 391 batches are just sub-batches in the dataset. I see the same thing but it eventually finishes evaluating.

Regarding training: are you doing both mntp followed by supervised? I'm starting from mntp checkpoint provided on huggingface for supervised training. @vaibhavad is it possible there are some differences between following either of the above methods?

stefanhgm · 2024-11-20T17:00:02Z

Hi @atutej

I trained mntp from scratch and supervised after it. So, I did not use the version from HuggingFace.

Here is my mteb eval code #140

viet-data closed this as completed Aug 5, 2024

vaibhavad reopened this Aug 6, 2024

This was referenced Aug 13, 2024

Possible to train Llama 3.1? #133

Open

MTEB Evaluation Running Time #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset #135

Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset #135

viet-data commented Aug 4, 2024

vaibhavad commented Aug 6, 2024

viet-data commented Aug 6, 2024

stefanhgm commented Aug 9, 2024

viet-data commented Aug 12, 2024

stefanhgm commented Aug 13, 2024

atutej commented Sep 20, 2024

stefanhgm commented Sep 27, 2024

vaibhavad commented Oct 2, 2024 •

edited

Loading

atutej commented Oct 28, 2024

stefanhgm commented Nov 20, 2024

Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset #135

Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset #135

Comments

viet-data commented Aug 4, 2024

vaibhavad commented Aug 6, 2024

viet-data commented Aug 6, 2024

stefanhgm commented Aug 9, 2024

viet-data commented Aug 12, 2024

stefanhgm commented Aug 13, 2024

atutej commented Sep 20, 2024

stefanhgm commented Sep 27, 2024

vaibhavad commented Oct 2, 2024 • edited Loading

atutej commented Oct 28, 2024

stefanhgm commented Nov 20, 2024

vaibhavad commented Oct 2, 2024 •

edited

Loading