Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint path should be absolute #111

Open
bhargav25dave1996 opened this issue Mar 28, 2024 · 4 comments
Open

Checkpoint path should be absolute #111

bhargav25dave1996 opened this issue Mar 28, 2024 · 4 comments

Comments

@bhargav25dave1996
Copy link

at time of training i am geeting this error :

code : python -m tevatron.tevax.experimental.mp.train_lora
--checkpoint_dir retriever-mistral-jax
--train_file Tevatron/msmarco-passage-aug
--model_name mistralai/Mistral-7B-v0.1
--model_type mistral
--batch_size 128
--num_target_passages 16
--learning_rate 1e-4
--seed 12345
--mesh_shape 1 -1
--weight_decay 0.00001
--num_epochs 1
--max_query_length 64
--max_passage_length 128
--pooling eos
--scale_by_dim True
--grad_cache
--passage_num_chunks 32
--query_num_chunks 4

Error:

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/irlab/tevatron/src/tevatron/tevax/experimental/mp/train_lora.py", line 394, in
main()
File "/home/irlab/tevatron/src/tevatron/tevax/experimental/mp/train_lora.py", line 375, in main
checkpoint_manager.save(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 515, in save
self._checkpointers[k].save(item_dir, item, **kwargs)
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/async_checkpointer.py", line 281, in save
commit_ops = asyncio.run(self._handler.async_save(tmpdir, args=ckpt_args))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/pytree_checkpoint_handler.py", line 835, in async_save
commit_futures = await asyncio.gather(*serialize_ops)
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1376, in serialize
tspec = self._get_json_tspec_write(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1273, in _get_json_tspec_write
tspec = self._get_json_tspec(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1253, in _get_json_tspec
tspec: Dict[str, Any] = get_tensorstore_spec(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 821, in get_tensorstore_spec
raise ValueError(f'Checkpoint path should be absolute. Got {directory}')
ValueError: Checkpoint path should be absolute. Got retriever-mistral-jax/0.orbax-checkpoint-tmp-1711610071493337/lora.orbax-checkpoint-tmp-1711610103114668

@luyug
Copy link
Contributor

luyug commented Mar 28, 2024

As it notes,

Checkpoint path should be absolute.

This is for compatibility w/ cloud storage.

@bhargav25dave1996
Copy link
Author

bhargav25dave1996 commented Mar 29, 2024

@luyug I am using Google Cloud TPU V4-8 as suggested. Can you help with the issue?

@MXueguang
Copy link
Contributor

--checkpoint_dir retriever-mistral-jax

change to something like /home/<your user name>/retriever-mistral-jax should work.

@bhargav25dave1996
Copy link
Author

bhargav25dave1996 commented Apr 2, 2024

@MXueguang Thanks, it resole issue.

I am facing an issue in encoding, I am using the below code to encode msmarco.

python -m tevatron.tevax.experimental.mp.encode
--model_type mistral
--model_name_or_path mistralai/Mistral-7B-v0.1
--model_config_name_or_path mistralai/Mistral-7B-v0.1
--tokenizer_name_or_path mistralai/Mistral-7B-v0.1
--dataset_name_or_path Tevatron/msmarco-passage-corpus
--output_dir /mnt/disk/corpus-embedding
--batch_size 32
--input_type passage
--max_seq_length 128
--mesh_shape 1 -1
--lora /mnt/disk/retriever-mistral-jax/lora
--scale_by_dim

But it does not save embedding at the output path. Please find the screenshot for the same below. @MXueguang @luyug can you help me in this?
Screenshot 2024-04-02 072835

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants