Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish Synthetic Datasets module #83

Open
davidberenstein1957 opened this issue Dec 11, 2024 · 1 comment · May be fixed by #121
Open

Finish Synthetic Datasets module #83

davidberenstein1957 opened this issue Dec 11, 2024 · 1 comment · May be fixed by #121

Comments

@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Dec 11, 2024

The evaluation module is not complete. It requires a finalised structure, some more informations, and exercises.

Structure
Here is a basic proposal for a structure:

  • what's synthetic data
  • synthetic data for instruction tuning + adding seed knowledge (magpie, selfinstruct)
  • synthetic data for preference tuning + llm evals + adding seed knowledge (idem + response generation + ultrafeedback)
  • improving synthetic data (injecting diversity, evolving/deita)
  • evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)

project

  • create a SFT dataset
  • transform the SFT dataset to preference dataset
  • improve preference dataset
  • evaluate and compare the improved and basic dataset

Comments

@burtenshaw
Copy link
Collaborator

Great. Thanks for outlining this.

The material you aligned sounds good. For now, I would focus on getting the core material for synthetic data, and structuring it like the other modules. Which would be something like this:

  • README
  • instruction_datasets.md
    • Magpie
    • SelfInstruct
  • preference_datasets.md
    • UltraFeedback
  • notebooks/
    • sft dataset project
    • dpo dataset project

I would say this is the minimum which aligns with the previous modules.

improving synthetic data (injecting diversity, evolving/deita)
evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)

I would say that these are good extras, which we can come back to if we have time.

@davidberenstein1957 davidberenstein1957 linked a pull request Dec 19, 2024 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants