Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on the SFT/reward model/PPO dataset numbers #25

Open
hanyin88 opened this issue Dec 19, 2023 · 0 comments
Open

Question on the SFT/reward model/PPO dataset numbers #25

hanyin88 opened this issue Dec 19, 2023 · 0 comments

Comments

@hanyin88
Copy link

hanyin88 commented Dec 19, 2023

Hello there,

Thanks for writing this excellent manuscript on RLHF for the summarization task. Definitely one of the best pieces in the recent years. I have some quick questions regarding the dataset splits among SFT/reward model/PPO and would like to confirm my thoughts below:

  1. For the SFT, we used the training dataset here with a size of 116k. We ran for a single epoch to get the supervised baseline (as suggested in Appendix B.1)
  2. To train the reward model, we used the human feedback dataset with a size of 64,832.
  3. For the final step of PPO, we used the validation set here with a size of 6447. This thought is based on Appendix A.1 where it says "....5% as a validation set. ...We used this set of posts for RL training".

A follow up question about PPO process is that in Appendix B.1 it was mentioned "...do 4 epochs of optimization for each batch of rollouts... and run for 1 million episodes". Could I clarify what do the million episodes entail?

Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant