Skip to content

Latest commit

 

History

History
69 lines (55 loc) · 7.25 KB

README.md

File metadata and controls

69 lines (55 loc) · 7.25 KB

awesome-synthetic-data

A curated list of awesome synthetic data tools (open source and commercial).

Inspired by Awesome Synthetic Data

Table of content

Open source tools

  • Copulas: a Python library for modeling multivariate distributions and sampling from them using copula functions.
  • CTGAN: SDV’s collection of deep learning-based synthetic data generators for single table data.
  • DataGene: a tool to train, test, and validate datasets, detect and compare dataset similarity between real and synthetic datasets.
  • DoppelGANger: a synthetic data generation framework based on generative adversarial networks (GANs).
  • DP_WGAN-UCLANESL: this solution trains a Wasserstein generative adversarial network (w-GAN) that is trained on the real private dataset.
  • DPSyn: an algorithm for synthesizing microdata while satisfying differential privacy.
  • Faker: a Python package that generates fake data (Note: this tool does not generate synthetic data but offers dummy data).
  • Generative adversarial nets for synthetic time series data: a repository that shows how to create synthetic time-series data using generative adversarial networks (GANs).
  • Gretel.ai: commercial synthetic data vendor that offers open source functionality.
  • mirrorGen: a python tool that generates synthetic data based on user-specified causal relations among features in the data.
  • Plait.py: a program for generating fake data from composable yaml templates.
  • Pydbgen: a Python package that generates a random database table based on the user's choice of data types.
  • Smart noise synthesizer: a differentially private open source synthesizer for tabular data.
  • Synner: an open source tool to generate real-looking synthetic data by visually specifying the properties of the dataset.
  • Synth: an open source data-as-code tool that provides a simple CLI workflow for generating consistent data in a scalable way.
  • Synthea: an open source synthetic patient generator that models the medical history of synthetic patients.
  • Synthetic data vault (SDV): one of the first open source synthetic data solutions, SDV provides tools for generating synthetic data for tabular, relational, and time series data.
  • TGAN: generative adversarial training for generating synthetic tabular data.
  • Tofu: a Python library for generating synthetic UK Biobank data.
  • Twinify: a software package for privacy-preserving generation of a synthetic twin to a given sensitive data set.
  • YData: synthetic structured data generator by YData, a commercial vendor.

Commercial solutions

  • Betterdata: vendor of a privacy-preserving synthetic data solution for AI, data sharing, or product development.
  • Datomize: vendor of a synthetic data solution for the development, training and testing of AI/ML models, and applications.
  • Diveplane: vendor of Geminai, a solution to generate synthetic ‘twin’ datasets with the same statistical properties as the original data.
  • Facteus: vendor of Mimic™ a synthetic data engine to synthesize data assets that protect consumer privacy.
  • Gretel: vendor of a synthetic data generation library and APIs for developers and data practitioners.
  • Hazy: vendor of a synthetic data platform for financial institutions that want to conduct data analysis.
  • Instill AI: vendor of a solution for synthetic data generation leveraging Generative Adversarial Networks and differential privacy.
  • Kymera Labs: vendor Synthetic Data Fabrication Software, a solution that generates new data without relying on the ML/GAN approach.
  • Mirry.ai: vendor of a synthetic data platform for generating synthetic data using GANs, available in Community, Cloud or Enterprise editions.
  • Mostly AI: vendor of Mostly Generate, a synthetic data generator that provides as-good-as-real, yet fully anonymous data.
  • Replica Analytics: vendor of Replica Synthesis, a software solution that ingests data and builds synthesis models to generate synthetic datasets.
  • Sarus technologies: vendor of ML software to help data practitioners leverage sensitive data assets for innovation with privacy guarantees.
  • Sogeti: vendor of Artificial Data Amplifier (ADA), a solution by the Sogeti Testing AI team that generates realistic data based on real data sets.
  • Statice: vendor of a software solution that generates privacy-preserving synthetic data that can be used as a drop-in replacement for an original dataset.
  • Syndata AB: vendor of a synthetic data generator to generate data sets that match the statistical attributes of real data but are entirely synthetic.
  • Synthesized: vendor of a DataOps platform enabling data sharing and collaboration across internal groups, remote teams, and external partners.
  • Syntheticus: Swiss vendor of a Swiss platform dedicated to generating synthetic data.
  • Syntho: vendor of AI software for generating synthetic data.
  • Tonic: vendor of a synthetic data generator to mimic production data.
  • Ydata: vendor of a synthesizer that mimics statistical information from real data and on new datasets without transforming the original data.

Online communities

  • Open SDP: an open online community for sharing educational analytic tools and resources.
  • OpenSynthetic: an open community for creating and using synthetic data in computer vision and machine learning (ML).
  • GenRocket Community community from GenRocket to ask questions and exchange ideas around test data and synthetic data.
  • Synthetic Data Vault Slack channel: the Slack channel from the SDV team.