Snowflake Synthetic Data Generator

Python scripts to generate synthetic data in Snowflake, based upon a list of input tables provided. The scripts rely on reading the input table schemas to understand the name & data types of columns, to in turn use the Python library Faker to generate meaningful fake data.

Summary

For each input table listed in ip/config.yaml, these scripts generate X amount of fake data records.
- Where X is based upon the value of num_records_to_generate in ip/config.yaml.
The script reads the table schemas of each input table list provided to understand the name & data types of inputs columns, to in turn generate fake data.
The script initially focusses on detecting the 5 groupings of Snowflake data types (see Summary of Data Types | docs.snowflake.com) listed below:
- String data types (i.e.: VARCHAR, TEXT and STRING)
- Numeric data types (i.e.: NUMBER, NUMERIC and DECIMAL)
- Date data types (i.e., DATE, DATETIME, TIME, TIMESTAMP, TIMESTAMP_{LTZ}{_NTZ}{_TZ})
- Boolean
- Binary
The script then inserts the generated fake data into each of the target tables.

Technologies Used

Makefile
Python
Python packages (see requirements.txt):
- snowflake-connector-python - to query the Snowflake DB from the python script.
- Faker - to generate fake data (see py/dt_data_generation.py).
- pandas - to write Snowflake query output to Pandas data frames.
- pyyaml & yq - to parse input from config.yaml
- cryptography - to render an input private RSA key to query the Snowflake DB.

Prerequisites

Before you begin, ensure you have met the following requirements:

Install the prerequisite python packages by running: make deps.
Provide values for each of the keys in ip/config.yaml. For a description breakdown of each of the input args, see ip/README.md.

How-to Run

Run make run to:

Write the table schemas for each input table listed underneath the input_tbls key in ip/config.yaml.
- Note: the table schema output is written to tmp/{input_tbl}.csv. See the python function get_table_schema() in snowflake_query.py for more details.
Generate X amount of fake data records (based upon the value of num_records_to_generate in ip/config.yaml) for each data type within each input table
- Note: the generated fake data for each table is written to op/{input_tbl}.csv. See the function generate_fake_data() in gen_fake_data.py for more details.
Insert the generated fake data into each of the target tables listed underneath the input_tbls key in ip/config.yaml. generate_fake_data
- See the function insert_fake_data() in gen_fake_data.py for more details.

Credits

This is an adapted version of the following README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snowflake Synthetic Data Generator

Summary

Technologies Used

Prerequisites

How-to Run

Credits

About

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
docs		docs
ip		ip
op		op
py		py
tests		tests
tmp		tmp
.gitignore		.gitignore
.markdownlint.rb		.markdownlint.rb
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
todos.txt		todos.txt

paulf-999/snowflake_synthetic_data_generator

Folders and files

Latest commit

History

Repository files navigation

Snowflake Synthetic Data Generator

Summary

Technologies Used

Prerequisites

How-to Run

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages