Python scripts to generate synthetic data in Snowflake, based upon a list of input tables provided. The scripts rely on reading the input table schemas to understand the name & data types of columns, to in turn use the Python library Faker
to generate meaningful fake data.
- For each input table listed in
ip/config.yaml
, these scripts generate X amount of fake data records.- Where X is based upon the value of
num_records_to_generate
inip/config.yaml
.
- Where X is based upon the value of
- The script reads the table schemas of each input table list provided to understand the name & data types of inputs columns, to in turn generate fake data.
- The script initially focusses on detecting the 5 groupings of Snowflake data types (see Summary of Data Types | docs.snowflake.com) listed below:
- String data types (i.e.:
VARCHAR
,TEXT
andSTRING
) - Numeric data types (i.e.:
NUMBER
,NUMERIC
andDECIMAL
) - Date data types (i.e.,
DATE
,DATETIME
,TIME
,TIMESTAMP
,TIMESTAMP_{LTZ}{_NTZ}{_TZ}
) - Boolean
- Binary
- String data types (i.e.:
- The script then inserts the generated fake data into each of the target tables.
Makefile
Python
- Python packages (see
requirements.txt
):snowflake-connector-python
- to query the Snowflake DB from the python script.Faker
- to generate fake data (seepy/dt_data_generation.py
).pandas
- to write Snowflake query output to Pandas data frames.pyyaml
&yq
- to parse input fromconfig.yaml
cryptography
- to render an input private RSA key to query the Snowflake DB.
Before you begin, ensure you have met the following requirements:
- Install the prerequisite python packages by running:
make deps
. - Provide values for each of the keys in ip/config.yaml. For a description breakdown of each of the input args, see
ip/README.md
.
Run make run
to:
- Write the table schemas for each input table listed underneath the
input_tbls
key inip/config.yaml
.- Note: the table schema output is written to
tmp/{input_tbl}.csv
. See the python functionget_table_schema()
insnowflake_query.py
for more details.
- Note: the table schema output is written to
- Generate X amount of fake data records (based upon the value of
num_records_to_generate
inip/config.yaml
) for each data type within each input table- Note: the generated fake data for each table is written to
op/{input_tbl}.csv
. See the functiongenerate_fake_data()
ingen_fake_data.py
for more details.
- Note: the generated fake data for each table is written to
- Insert the generated fake data into each of the target tables listed underneath the
input_tbls
key inip/config.yaml
. generate_fake_data- See the function
insert_fake_data()
ingen_fake_data.py
for more details.
- See the function
This is an adapted version of the following README.