Skip to content

πŸ“– A curated list of resources dedicated to synthetic data

License

Notifications You must be signed in to change notification settings

gretelai/awesome-synthetic-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

awesome-synthetic-data

Awesome

A curated list of resources dedicated to Synthetic Data

If you want to contribute to this list, read the contribution guidelines first. Please add your favorite synthetic data resource by raising a pull request

Also, a listed repository should be deprecated if:

  • Repository's owner explicitly says that "this library is not maintained".
  • Not committed for a long time (2~3 years).

Contents

Research Summaries and Trends

Back to Top

Tutorials

Back to Top

Reading Content

Back to Top

Introductions and Guides to Synthetic Data

Blogs and Newsletters

Videos and Online Courses

Videos and Online Courses

Back to Top

Diffusion Models

Libraries

Open Source Generative Synthetic Data Models, Libraries and Frameworks | Back to Top

Text, Tabular and Time-Series

  • gretel-synthetics - Generative models for structured and unstructured text, tabular, and multi-variate time-series data featuring differentially private learning.
  • SDV - Synthetic Data Generator for tabular, relational, and time series data.
  • Synthea - Synthetic Patient Population Simulator.
  • ydata-synthetic - Synthetic structured data generators.
  • synthpop - A tool for producing synthetic versions of microdata.

Image

Audio

  • Jukebox - OpenAI's Jukebox- A Generative Model for Music.

Simulation

  • AirSim - AirSim is a simulator for drones, cars and more, built on Unreal and Unity engines.
  • Nvidia Dataset Synthesizer - NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-quality synthetic images with metadata.
  • OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.
  • Unity Perception Perception toolkit for sim2real training and validation in Unity.

Video

Academic Papers

Back to Top

Language Models

  • Evaluating Large Language Models Trained on Code (2021) Mark Chen et al. [pdf]

Generative Adversarial Networks (GANs)

  • Modeling Tabular Data using Conditional GAN (2019) Xu et al. [pdf]
  • Generating Long Videos of Dynamic Scenes (2022) Tim Brooks [pdf]
  • Generative Adversarial Networks (2014) Ian J. Goodfellow et al. [pdf]
  • Conditional Generative Adversarial Nets (2014) Mehdi Mirza et al. [pdf]
  • Modeling Tabular Data using Conditional GAN (2019) Xu et al. [pdf]
  • Wasserstein GAN (2017) Martin Arjovsky, et al.[pdf]
  • Improved Training of Wasserstein GANs (2017) Ishaan Gulrajani, et al. [pdf]
  • Time-series Generative Adversarial Networks (2019) Jinsung Yoon, et all [pdf]

Diffusion Models

  • Generative Modeling by Estimating Gradients of the Data Distribution (2021) Yang Song [pdf]
  • Diffusion Models are Autoencoders S. Dielman (2021) [pdf]
  • Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015) J Sohl-Dickstein et al. [pdf]
  • KNN-Diffusion: Image Generation via Large-Scale Retrieval (2022) Oron Ashual [pdf]

Fair AI

  • A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle (2021) Harini Suresh, John Guttag [pdf]
  • DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks (2021) Boris van Breugel et al [pdf]
  • On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? (2021) Emily M. Bender, et al. [pdf]
  • A Survey on Bias and Fairness in Machine Learning (2022) Ninareh Mehrabi [pdf]
  • AI Fairness (Approaches & Mathematical Definitions) (2022) Jonathan Hui [blog]
  • AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias (2018) Rachel K. E. Bellamy et al [pdf]

Algorithmic Privacy

  • Deep Learning with Differential Privacy (2016) Abadi et al. [pdf]
  • An Efficient DP-SGD Mechanism for Large Scale NLP Models (2021) Dupuy et al. [pdf]
  • PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (2018) Jordon et al. [pdf]
  • Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence (2021) Cao et al. [pdf]
  • Differentially Private Fine-tuning of Language Models (2022) Yu et al. [pdf]

Services

Synthetic Data as API with higher level functionality such model training, fine-tuning, and generation | Back to Top

Prominent Synthetic Data Research Labs

Back to Top

Datasets

Back to Top

License

License - CC0