-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do you also have an LSTM implementation? #1
Comments
Hi, I'm glad someone found it useful. Unfortunately, I haven't got time yet to implement it. I definitely will one day, but I'm not sure when I will find some time for it. |
I just finished implementing it. It's still a massive mess though with lots of hackery so I won't bother you with it, but I might clean it up and let you know if you'd like :) I really like how clear every function is in your code. You make me want to improve my own coding. |
Heh, everything emerges from mess :) Yes, sure I'd be happy to see your take on it, it's always nice to have some reference during coding, especially with ML, where the devil is in the details. |
I will :) I'm cleaning it while I'm figgering out how to connect the models to Java through onnx/tensorflow/keras. I also changed some of the algorithm in my version. For example I'm normalizing the curiosity rewards and instead of using .exp() on the difference of the logs I'm using an approximation that doesn't explode. I also simplified some of the hyper parameters. I'm getting full solves of pendulum in a bit less than 20 epochs, so all your renders sticking up in the air. Btw your tensorboard logs are super useful! Because of that I realised that Tanh's are preferred in the agent model because they are slower than the Relu's, allowing the ICM to keep up. Also they're probably less jump-to-conclusiony making them more stable. Also I'm not using the "recurrent" parameter yet since it makes saving the hidden states tricky while maintaining compatibility with the run_[...].py files, but I guess I'll figger that out after further cleaning. |
Hi, I'm also interested in (statefull) LSTM implementation. So far I have changed some of your code to use statefull LSTM and removed multienv to run on my env in single process(felt easier to work with). ICM now runs on each episode seperately (instead of your [n_env,batch_size,n_features] its [batch_size, n_timesteps, n_features]) and later its concated to [n_env_spisodes, batch_size, n_timesteps, n_features] for PPO training input. But I have problems with diverging losses and rewards (viz my post here ). So now I'm curious if my approach with LSTM is correct.
Divergence persist even after reworking it to use batches on all places it uses models (ICM in reward and loss, PPO in getting old policies and training) |
I really love this implementation, and I see that LSTM is still in the TODO. Have you made any progress on this in the last two months or should I just do it myself?
The text was updated successfully, but these errors were encountered: