Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train in multi-node multi-card environment #110

Open
Atlantic8 opened this issue Mar 27, 2024 · 2 comments
Open

Train in multi-node multi-card environment #110

Atlantic8 opened this issue Mar 27, 2024 · 2 comments

Comments

@Atlantic8
Copy link

can I use tevatron to train models in multi-node multi-card environment ?
if yes, could you please give script examples to demonstrate how to start the job, thank you

@MXueguang
Copy link
Contributor

Hi @Atlantic8, for pytorch implementation, unfortunately, we didn't get chance to run&test on multi-node environment yet.

@luyug
Copy link
Contributor

luyug commented Mar 28, 2024

add on top of it, for jax it really depends on your cluster. for cloud TPU,

gcloud compute tpus tpu-vm ssh YOUR_TPU_NAME \
    --zone=us-central2-b \
    --worker=all \
    --command="python -m tevatron.tevax.experimental.mp.train ..."

Adjust it to use the launch script that fits your cluster config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants