Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to run RL using multi-nodes in cluster #1133

Open
HYB777 opened this issue May 2, 2024 · 2 comments
Open

how to run RL using multi-nodes in cluster #1133

HYB777 opened this issue May 2, 2024 · 2 comments
Labels
documentation question Further information is requested

Comments

@HYB777
Copy link

HYB777 commented May 2, 2024

How to use RayVecEnv in cluster? I want to run my rl code using multi-nodes training, I'm new to ray, is there some demos scripts?

@MischaPanch
Copy link
Collaborator

Hi @HYB777. This is a ray config issue - as long as you configure ray on a multi-node cluster, run ray.init appropriately, and use the RayVecEnv, things should work out.

That being said, I haven't tested personally on a multi-node cluster yet.

Since we're not ray developers, I think this question is outside of the scope for support from the tianshou team. However, if you encounter tianshou specific issues on the cluster, feel free let us know!

Ray has a large community and a lot of documentation, I suggest you start there. If you want to contribute a multi-node running example, I'm happy to review a PR

@MischaPanch MischaPanch added question Further information is requested documentation labels May 3, 2024
@destin-v
Copy link

If you want to run RayVecEnv in a cluster you have to setup multiple Ray workers and and connect all of them to the IP address of the Ray head node. This is done using the ray.init command. Here is an example that gets the IP address from every worker node that is connected to a Ray Cluster. If this runs on your multi-node server, you will be able to do the same with RayVecEnv.

import socket
import time
from collections import Counter import ray

@ray.remote
def f():
    time.sleep(0.001)
    return socket.gethostbyname(socket.gethostname())


def main(address: str):
    ray.init(address=address) # This needs to be replaced with your IP address

    futures = [f.remote() for _ in range(10000)]
    ip_addresses = ray.get(futures)
    for ip_address, num_tasks in Counter(ip_addresses).items():
        print(" {} tasks on {}".format(num_tasks, ip_address))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants