You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for writing this excellent manuscript on RLHF for the summarization task. Definitely one of the best pieces in the recent years. I have some quick questions regarding the dataset splits among SFT/reward model/PPO and would like to confirm my thoughts below:
For the SFT, we used the training dataset here with a size of 116k. We ran for a single epoch to get the supervised baseline (as suggested in Appendix B.1)
To train the reward model, we used the human feedback dataset with a size of 64,832.
For the final step of PPO, we used the validation set here with a size of 6447. This thought is based on Appendix A.1 where it says "....5% as a validation set. ...We used this set of posts for RL training".
A follow up question about PPO process is that in Appendix B.1 it was mentioned "...do 4 epochs of optimization for each batch of rollouts... and run for 1 million episodes". Could I clarify what do the million episodes entail?
Many thanks.
The text was updated successfully, but these errors were encountered:
Hello there,
Thanks for writing this excellent manuscript on RLHF for the summarization task. Definitely one of the best pieces in the recent years. I have some quick questions regarding the dataset splits among SFT/reward model/PPO and would like to confirm my thoughts below:
A follow up question about PPO process is that in Appendix B.1 it was mentioned "...do 4 epochs of optimization for each batch of rollouts... and run for 1 million episodes". Could I clarify what do the million episodes entail?
Many thanks.
The text was updated successfully, but these errors were encountered: