Question on the SFT/reward model/PPO dataset numbers #25

hanyin88 · 2023-12-19T19:54:27Z

Hello there,

Thanks for writing this excellent manuscript on RLHF for the summarization task. Definitely one of the best pieces in the recent years. I have some quick questions regarding the dataset splits among SFT/reward model/PPO and would like to confirm my thoughts below:

For the SFT, we used the training dataset here with a size of 116k. We ran for a single epoch to get the supervised baseline (as suggested in Appendix B.1)
To train the reward model, we used the human feedback dataset with a size of 64,832.
For the final step of PPO, we used the validation set here with a size of 6447. This thought is based on Appendix A.1 where it says "....5% as a validation set. ...We used this set of posts for RL training".

A follow up question about PPO process is that in Appendix B.1 it was mentioned "...do 4 epochs of optimization for each batch of rollouts... and run for 1 million episodes". Could I clarify what do the million episodes entail?

Many thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on the SFT/reward model/PPO dataset numbers #25

Question on the SFT/reward model/PPO dataset numbers #25

hanyin88 commented Dec 19, 2023 •

edited

Loading

Question on the SFT/reward model/PPO dataset numbers #25

Question on the SFT/reward model/PPO dataset numbers #25

Comments

hanyin88 commented Dec 19, 2023 • edited Loading

hanyin88 commented Dec 19, 2023 •

edited

Loading