-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The self.worker parameter obtained by loading the checkpoint file has a value of None and how to train on GPU #5
Comments
Hi @MurrayMa0816, have you ever tried to pin your |
Hi, @XuehaiPan ,thank you for your response. I have tried two versions, Ray 1.12.0 and Ray 1.13.0, and encountered the same issue. Now, my workaround that doesn't solve the root problem is to check the value before converting the tensor. If the value is None, I convert it to False before converting it to a tensor. This way, I avoid the error. It is equivalent to setting the "fused" and "foreach" parameters to False forcefully. Although the program runs without errors, I'm not sure if this approach may introduce other potential issues. |
@MurrayMa0816 Since only the policy weights are needed in the checkpoint, I think you can remove the optimizer-related items in the worker state: self.worker['state']['shared_policy'].pop('_optimizer_variables', None) |
@XuehaiPan , thank you very much for helping me confirm this issue that has been bothering me for a long time. There is another issue during training that I would like to consult with you:
My hardware configuration: The settings for num_workers and num_gpus during training are as follows: In psro/train.py, they are set as follows: train_kwargs = {
'num_workers': num_workers, # Set 6 worker processes for the camera and target separately based on the number of CPUs available.
'num_gpus': num_gpus, # 1 GPU, intending to share it among 12 workers.
'num_envs_per_worker': num_envs_per_worker, # 8
'seed': seed, }
not_ready = [
camera_trainer.result.remote(skip_train_if_exists=True,**train_kwargs),
target_trainer.result.remote(skip_train_if_exists=True, **train_kwargs), ] However, when calling hrl/mappo/camera/train.py and mappo/target/train.py, you both set num_gpus_per_worker=0. Does this mean that GPU is not being used? experiment.spec['config'].update(
num_cpus_for_driver=NUM_CPUS_FOR_TRAINER,
num_gpus=num_gpus,
num_gpus_per_worker=0,
num_workers=num_workers,
num_envs_per_worker=num_envs_per_worker, ) The code that indicates training on the CPU is in ray/rllib/policy/torch_policy, specifically: worker_idx = self.config.get("worker_index", 0)
if not config["_fake_gpus"] and ray.worker._mode() == ray.worker.LOCAL_MODE:
num_gpus = 0
elif worker_idx == 0:
num_gpus = config["num_gpus"]
else:
num_gpus = config["num_gpus_per_worker"]
gpu_ids = list(range(torch.cuda.device_count()))
#Place on one or more CPU(s) when either:
#- Fake GPU mode.
#- num_gpus=0 (either set by the user or we are in local_mode=True).
#- No GPUs available.
if config["_fake_gpus"] or num_gpus == 0 or not gpu_ids:
logger.info( "TorchPolicy (worker={}) running on {}.".format( worker_idx if worker_idx > 0 else "local", "{} fake-
GPUs".format(num_gpus) if config["_fake_gpus"] else "CPU", ) ) Tried approaches: I would like to have all 12 worker processes running on the GPU.
@ray.remote(max_restarts=1,num_gpus=1)
class PlayerTrainer:
def init( self, iteration, player, train_fn, base_experiment, opponent_agent_factory, from_checkpoint, timesteps_total, local_dir, project=None, group=None, **kwargs, ):
python3 -m examples.psro.train
--project mate-psro
--meta-solver NE
--num-workers 32 --num-envs-per-worker 8 --num-gpus 0.5
--timesteps-total 5E6 --num-evaluation-episodes 10 --seed 0 The above mentioned approaches have not been able to solve the problem of training on GPU and made me confused. Could you please provide me with some suggestions for the configuration? Thank you very much. Also, I would like to konow how long it takes to complete the PSRO training in your repository. Thank you very very much. |
There is a trainer process and several worker processes for rollout sampling. The configuration
If you set
You will not need to do this. The resource requirements are set in the experiment configuration. If you decorate the
I set this because the network is relatively small. If you set
Ideally, the PRSO algorithm requires the underlining MARL problem (e.g., MAPPO for the camera team only against fixed policy targets) to produce the best response (BR) to its opponent. So you need to wait for each |
@XuehaiPan Thank you very much for helping me solve each confusion. Thank you again. |
Hi, @XuehaiPan , I apologize for the inconvenience, but I have a few questions I'd like to ask you. I noticed that when you were using the PSRO algorithm, each team was trained for 5 million steps, and each PlayerTrainer process converged within 2-5 hours. I'd like to know what kind of resources you were using, such as how many CPU cores and how much GPU memory? |
Hi @MurrayMa0816. The computation requirements in the experiment are specified as: Lines 3 to 13 in 3e631c0
Lines 52 to 56 in 3e631c0
Each team is trained with 32 workers (each use 1 CPU). The total CPU cores are around 80 for the PSRO algorithm. The GPU memory consumption is relatively small because the network is not very deep (if you do not assign any GPU resource to the rollout worker and only the trainer has GPUs). Maybe you need to setup a ray cluster to use more CPUs from other nodes to speed up your training. |
Hi @XuehaiPan I see. Thanks again for your kind reply. |
Hello, the Mate you shared has been extremely helpful for my research. I am currently studying your code, but I have encountered a bug that I haven't been able to solve despite spending a lot of time on it. Could you please take a look and assist me?
Here is the process that triggered my issue:
By loading the aforementioned checkpoint-1, it retrieves self.worker. When setting the state of self.worker, the values of the parameters "fused" and "foreach" in self.worker['state']['shared_policy']['_optimizer_variables'][0]['param_groups'][0] are both None.
When the values of fused and foreach are None and are being converted to a tensor as items, the following error occurs:
My approach has been:
When loading checkpoint-1, the values of foreach and fused in self.worker are None, suggesting that either these parameters were not present during the generation of checkpoint-1 or their values were intentionally set to None. I have gone through the code inside tune.run() line by line but have been unable to find the location where checkpoint-1 is generated. Therefore, I cannot confirm how foreach and fused were set when generating checkpoint-1.
2.Add parameters in the config file.
Within example/hrl/mappo/camera/config.py, I added the parameters 'foreach': False and 'fused': True under config['model']['custom_model_config']. However, when loading checkpoint-1, the values of these two parameters remained as None.
These are the two approaches I have tried, but neither of them has resolved the issue. I would greatly appreciate any insights you can provide.
The text was updated successfully, but these errors were encountered: