You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To create the dataset, we first selected 100K high-quality Magpie instructions with diverse task categories, then generated responses using Llama 3 8B Instruct 5 times for each instruction, using a temperature of 0.8. We then annotated RM scores using RLHFlow/ArmoRM-Llama3-8B-v0.1, labeling the response with the highest RM score as the chosen response, and the one with the lowest RM score as the rejected response.
Very wonderful work!
When I have filtered 300k data, I want to know how to get this 100k subset to synthesize DPO data.
If you can provide this part of the data filtering code, I believe it will be very helpful.
The text was updated successfully, but these errors were encountered:
Thank you for your question. This 100K was filtered empirically lol. We noted that the original Magpie dataset had too many information-seeking and advice-seeking entries, so we manually decreased their proportion in the DPO phase and made the task categories more diverse and balanced.
Very wonderful work!
When I have filtered 300k data, I want to know how to get this 100k subset to synthesize DPO data.
If you can provide this part of the data filtering code, I believe it will be very helpful.
The text was updated successfully, but these errors were encountered: