-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to replicate original PPO performance #1157
Comments
@rajfly Are you able to use |
@dantp-ai Hi, thanks for your prompt reply. In fact, I used TLDR: No, I was not able to just use |
Thanks! I will look into it and see how I can help. |
@dantp-ai Thanks for your help! Also perhaps this might help narrow down the issue: I tested on 56 Atari games and Tianshou failed to learn anything at all on the majority of games and was only able to do well for very simple games (approximately 5 out of the 56 games). For example, the Boxing game shown below with Tianshou in green. So the agent can learn and is able to compete with the implementations by Baselines, Stable Baselines3, and CleanRL, but only in very simple environments, which is weird. ![]() |
Thanks for reporting! It is of highest priority to us to keep good performance of algorithms and examples (otherwise, what's the point ^^). Seems like the small performance tests that run in CI were not enough to catch this. I have been training PPO agents on mujoco in the current tianshou version with no issues, so maybe it's only for discrete envs. We will look into it asap. The first thing to clarify is whether the problem is caused by the recent refactorings, thus going back to version 0.5.1 and running on atari there. Btw, before the 2.0.0 release of tianshou we will implement #935 and #1110, as well as check in a script that reproduces the results that are currently displayed in the docs. From there on, all releases will guarantee to have no performance issues. At the moment we're not there yet |
@rajfly Were you able to verify that the reward scalings and reward outputs are consistent across the experiments using the different RL libraries (OpenAI Baselines, StableBaselines3, CleanRL) ? |
@dantp-ai Yes. I followed the same Atari wrappers for all of the experiments with the Atari games. Thus, the rewards were only clipped and were clipped for all RL libraries. Furthermore, when comparing the reward outputs, I used statistical techniques such as stratified bootstrap confidence intervals (SBCI) to combat the stochasticity for more accurate estimates. In particular, for each RL library tested, I ran 5 trials for each of the 56 Atari environments. This evaluates to a total of 56 x 5 trials per RL library. Subsequently, I took the mean reward from the last 100 training episodes as the score for a single trial and human-normalized it. The plot below shows the comparison of the human-normalized score attained from different RL libraries across 56 environments, using SBCI. The bands are 95% confidence interval bands and it can be seen that Baselines, Stable Baselines3, and CleanRL are consistent with their scores for most metrics (IQM refers to interquartile mean). ![]() |
Sorry for the late response - I got pulled into urgent project business for 6 weeks. But now I'm back with more time on my hands. PPO on mujoco runs well, and the discrete variant of PPO is the exact same implementation as the continuous (just the action-dist is different). So, I suspect the problem is mainly in some atari-specific things. I can't promise for next week, but within the next two I'll devote time to solving this. @carlocagnetta might actually have some insights, as he used atari-like networks and envs recently |
@rajfly your experiments required quite a lot of compute, thanks for them! Would you be interested on solving this together? |
@MischaPanch Yes, I think so too. It is likely either one of the existing methods in Tianshou which I used to make the PPO implementation similar to the implementation by Baselines (which is not frequently used), or a configuration error on my part.
No problem! Yes, though my schedule is quite packed for the upcoming two weeks (I will try to help solve this when I can). After which, I will be able to dedicate more time into solving this. |
I can’t seem to replicate the original PPO algorithm's performance when using Tianshou's PPO implementation. The hyperparameters used are listed below. It follows the hyperparameters discussed in an ICLR Blog in aims to replicate the results from the original PPO paper (without LSTM).
Hyperparameters
I have tried these same hyperparameters with the Baselines, Stable Baselines3, and CleanRL implementations of the PPO algorithm and they all achieved the expected results. However, the Tianshou agent fails to train at all, as seen in the training curves below (Tianshou's PPO trials are shown in green). Am I missing something in my Tianshou configuration (see reproduction scripts) or is there a bug (or intentional discrepancy) in Tianshou's PPO implementation?
Tianshou training curves in green for 5 games when compared to other implementations
![Screenshot 2024-05-31 at 9 03 57 PM](https://private-user-images.githubusercontent.com/35727146/335588307-a458f49b-c14e-48de-80d3-4cc89728b608.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE4MDg4NzAsIm5iZiI6MTcyMTgwODU3MCwicGF0aCI6Ii8zNTcyNzE0Ni8zMzU1ODgzMDctYTQ1OGY0OWItYzE0ZS00OGRlLTgwZDMtNGNjODk3MjhiNjA4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI0VDA4MDkzMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTE5NWQ4M2ZkNzcxMmE1NTFmNjY3NWQ5NWE2ZjQ4MjVhNzU4YzJhNGM5M2Q5YWNiMzhiYzBmODUzZTA4ZWNkZDEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.XiSeBrAAJAsbLWV8iO8cXxOMTwRIZDxBnoa4GDyjHiE)
NOTE: The y-axis and x-axis represents mean reward and in-game frames (total of 40 million) respectively.
Other Issues Found
Also, another issue found is that for some games like Atlantis, BankHeist, or YarsRevenge, training can sometimes randomly stop with the following error, though I am not entirely sure why:
Reproduction Scripts
Run command:
python ppo_atari.py --gpu 0 --env Alien --trials 5
Main Script (ppo_atari.py):
Dependencies of Main Script (include these 3 scripts in same directory as the main script):
atari_network.py
atari_wrapper.py
common.py
The text was updated successfully, but these errors were encountered: