Finish Synthetic Datasets module #83

davidberenstein1957 · 2024-12-11T07:11:14Z

The evaluation module is not complete. It requires a finalised structure, some more informations, and exercises.

Structure
Here is a basic proposal for a structure:

what's synthetic data
synthetic data for instruction tuning + adding seed knowledge (magpie, selfinstruct)
synthetic data for preference tuning + llm evals + adding seed knowledge (idem + response generation + ultrafeedback)
improving synthetic data (injecting diversity, evolving/deita)
evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)

project

create a SFT dataset
transform the SFT dataset to preference dataset
improve preference dataset
evaluate and compare the improved and basic dataset

Comments

burtenshaw · 2024-12-11T07:46:13Z

Great. Thanks for outlining this.

The material you aligned sounds good. For now, I would focus on getting the core material for synthetic data, and structuring it like the other modules. Which would be something like this:

README
instruction_datasets.md
- Magpie
- SelfInstruct
preference_datasets.md
- UltraFeedback
notebooks/
- sft dataset project
- dpo dataset project

I would say this is the minimum which aligns with the previous modules.

improving synthetic data (injecting diversity, evolving/deita)
evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)

I would say that these are good extras, which we can come back to if we have time.

davidberenstein1957 linked a pull request Dec 19, 2024 that will close this issue

[MODULE] add synthetic data generation #121

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish Synthetic Datasets module #83

Finish Synthetic Datasets module #83

davidberenstein1957 commented Dec 11, 2024 •

edited

Loading

burtenshaw commented Dec 11, 2024

Finish Synthetic Datasets module #83

Finish Synthetic Datasets module #83

Comments

davidberenstein1957 commented Dec 11, 2024 • edited Loading

burtenshaw commented Dec 11, 2024

davidberenstein1957 commented Dec 11, 2024 •

edited

Loading