[SUBMISSION] Automated Policy-based Preference Alignment using Synthetic Data Generation #108

souzatharsis · 2024-12-15T08:47:20Z

December 2024 Student Submission

See html rendered submission here for ease of review.
Accompanying Python Notebook is in the PR.

Module Completed

Changes Made

In this case study, we demonstrate how to use DPO to align a language model to a user-provided policy further automating the process via synthetic data generation and LLM-as-judge evaluation.

We go over a Case Study for Acme Inc., a company dedicated to democratizing access to computer science education for K-12 students. Acme Inc. is in the process of creating a chatbot named smolK-12, a small open source LLM, specifically designed for K-12 students.

We’ll explore how to align a language model with Acme Inc.’s policy to ensure its LLM-powered applications are safe and appropriate for K-12 students.

Notebooks Added/Modified

List any notebooks you've added or modified:

Added new example in 2_preference_alignment/notebooks/smolk12
Modified existing notebook with additional examples
Added documentation or comments

Checklist

I have read the module materials
My code runs without errors
I have pushed models and datasets to the huggingface hub
- A K-12 preference DPO-based dataset
- A K-12 aligned language model
My PR is based on the december_2024 branch

Questions or Discussion Points

Add any questions you have or points you'd like to discuss:
I am particularly interested in your feedback pertaining to the points I've raised in the Discussion section around:

Synthetic Data Generation
Choice of Base Model
Evaluation Methodology
DPO Dataset Composition
Fine-tuning Process

Additional Notes

Any other information that might be helpful for reviewers:

This is a Case Study part of an open source book I am writing "Taming LLMs".

I would love to highlight the great smolLM work you are doing here. Hence, I would truly appreciate your feedback on the here submitted Case Study.

Cheers,
Tharsis.

burtenshaw · 2024-12-16T08:44:35Z

This is a really nice submission @souzatharsis. Thanks!

My first question is: Have you considered using a library like distilabel?

We're currently working on the synthetic data module and I think your use case could fit there.

souzatharsis · 2024-12-16T09:19:37Z

Hi @burtenshaw , thanks for the feedback!
distilabel sounds indeed an elegant way to replicate my data generation process, thanks for the recommendation.

souzatharsis · 2024-12-16T10:38:17Z

Looking forward to additional feedback!

souzatharsis added 2 commits December 15, 2024 05:44

Create README.md

cb39e15

Add files via upload

3d545f3

souzatharsis mentioned this pull request Dec 15, 2024

[SUBMISSION] Automated Policy-based Preference Alignment using Synthetic Data Generation and LLM-as-a-Judge #104

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUBMISSION] Automated Policy-based Preference Alignment using Synthetic Data Generation #108

[SUBMISSION] Automated Policy-based Preference Alignment using Synthetic Data Generation #108

souzatharsis commented Dec 15, 2024

burtenshaw commented Dec 16, 2024

souzatharsis commented Dec 16, 2024

souzatharsis commented Dec 16, 2024

[SUBMISSION] Automated Policy-based Preference Alignment using Synthetic Data Generation #108

Are you sure you want to change the base?

[SUBMISSION] Automated Policy-based Preference Alignment using Synthetic Data Generation #108

Conversation

souzatharsis commented Dec 15, 2024

December 2024 Student Submission

Module Completed

Changes Made

Notebooks Added/Modified

Checklist

Questions or Discussion Points

Additional Notes

burtenshaw commented Dec 16, 2024

souzatharsis commented Dec 16, 2024

souzatharsis commented Dec 16, 2024