Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset #135

Open
viet-data opened this issue Aug 4, 2024 · 10 comments

Comments

@viet-data
Copy link

I have been attempting to reproduce the training results on the same echo data. Due to hardware limitations, I had to reimplement the training process using GradCache.

Although my model code can load the LLM2Vec public checkpoint and perform inference correctly, I am unable to achieve comparable performance to LLM2Vec when training a bidirectional Mistral model (without MNTP and unsupervised SimCSE) using GradCache. My training used a batch size of 512 on the echo dataset and stopped after 750 iterations.

Specifically, on the STS tasks, I have not been able to exceed 75 on SICK-R and 65 on STS-12 (other tasks also show low performance, except for BIOSSES).

Has anyone else tried to train LLM2Vec with GradCache, or has anyone successfully reproduced the LLM2Vec results using the original code? Any insights or suggestions would be greatly appreciated.

@vaibhavad
Copy link
Collaborator

Hello @viet-data,

where you able to reproduce with GradCache? If you are interested, we'll like to integrate GradCache to the LLM2Vec library

@vaibhavad vaibhavad reopened this Aug 6, 2024
@viet-data
Copy link
Author

Hi @vaibhavad ,

I have successfully trained with gradcache, using a batch size of 128, and achieved results close to those reported in llm2vec. However, I'm curious about llm2vec's performance when scaling up the data. I haven't been able to improve performance with more training data, which might be due to the smaller batch size.

Could you share the llm2vec results when training on the full dataset? It would be very useful if you could integrate gradcache into llm2vec to help us train with fewer GPUs. Thank you.

@stefanhgm
Copy link

Hi @viet-data,

I reproduced the Llama 3 supervised version trained 1000 steps on the MNTP task and 1000 steps on the E5 dataset (according to the original LLM2Vec) training configs. I currently run the full MTEB evaluation, but the first results look very similar to the ones reported on HuggingFace for the model.

I am currently training a Llama 3.1 version with the same training recipe.

@viet-data
Copy link
Author

@stefanhgm Thanks so much for sharing! I agree, llm2vec seems quite reproducible. Excited to see your results with Llama 3.1

@stefanhgm
Copy link

Currently, the evaluation of the Llama 3.1 version on MTEB hangs on a task where repeatedly 391 batches are processed. It already repeats this for over a day now. I think it is the DBPedia task as the CQADupstackWordpressRetrieval and ClimateFEVER tasks were completed last and DBPedia should come next.

@vaibhavad any chance you observed a similar behavior when evaluating on MTEB?

Screenshot 2024-08-13 at 14 51 19

@atutej
Copy link

atutej commented Sep 20, 2024

Hi @stefanhgm!

Would it be possible to share your reproduced numbers? I am currently following the LLM2Vec recipe and for some benchmarks (eg: FiQA2018) the numbers I get are way off from what was reported. 48% vs 55%...

@stefanhgm
Copy link

Hi @atutej,

I still have problem with running all tasks because the running time is just very long even when using multiple GPUs. I also asked a question regarding there here: #140

I got FiQA2018 running though and obtained "main_score": 0.55441 on the test set, so for me the results seem reproducible.

@vaibhavad
Copy link
Collaborator

vaibhavad commented Oct 2, 2024

Currently, the evaluation of the Llama 3.1 version on MTEB hangs on a task where repeatedly 391 batches are processed. It already repeats this for over a day now. I think it is the DBPedia task as the CQADupstackWordpressRetrieval and ClimateFEVER tasks were completed last and DBPedia should come next.

@vaibhavad any chance you observed a similar behavior when evaluating on MTEB?

Screenshot 2024-08-13 at 14 51 19

This is strange behaviour, I haven't faced this issue. Can you share your evaluation script?

@atutej
Copy link

atutej commented Oct 28, 2024

Hi @stefanhgm

I think the 391 batches are just sub-batches in the dataset. I see the same thing but it eventually finishes evaluating.

Regarding training: are you doing both mntp followed by supervised? I'm starting from mntp checkpoint provided on huggingface for supervised training. @vaibhavad is it possible there are some differences between following either of the above methods?

@stefanhgm
Copy link

Hi @atutej

I trained mntp from scratch and supervised after it. So, I did not use the version from HuggingFace.

Here is my mteb eval code #140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants