A tool to automatically infer columns data types in .csv files
Check the article here: Building a Schema Inference Data Pipeline for Large CSV files
pip install csv-schema-inference
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.9
from csv_schema_inference import csv_schema_inference
#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}
csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
pathfile = "/content/file__500k.csv"
aprox_schema = csv_infer.run_inference(pathfile)
csv_infer.pretty(aprox_schema)
0
name
id
type
INTEGER
nullable
False
1
name
full_name
type
STRING
nullable
True
2
name
age
type
INTEGER
nullable
False
3
name
city
type
STRING
nullable
True
4
name
weight
type
FLOAT
nullable
False
5
name
height
type
FLOAT
nullable
False
6
name
isActive
type
BOOLEAN
nullable
False
7
name
col_int1
type
INTEGER
nullable
False
8
name
col_int2
type
INTEGER
nullable
False
9
name
col_int3
type
INTEGER
nullable
False
10
name
col_float1
type
FLOAT
nullable
False
11
name
col_float2
type
FLOAT
nullable
False
12
name
col_float3
type
FLOAT
nullable
False
13
name
col_float4
type
FLOAT
nullable
False
14
name
col_float5
type
FLOAT
nullable
False
15
name
col_float6
type
FLOAT
nullable
False
16
name
col_float7
type
FLOAT
nullable
False
17
name
col_float8
type
FLOAT
nullable
False
18
name
col_float9
type
FLOAT
nullable
False
19
name
col_float10
type
FLOAT
nullable
False
20
name
test_column
type
FLOAT
nullable
False
result = csv_infer.get_schema_columns(columns = {"test_column"})
csv_infer.pretty(result)
20
_name
test_column
types_found
INTEGER
cnt
406130
FLOAT
cnt
50964
nullable
False
type
FLOAT
result = csv_infer.explore_schema_column(column = "test_column")
csv_infer.pretty(result)
20
name
test_column
types_found
INTEGER
88.85043339006856
FLOAT
11.149566609931437
nullable
False
The tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time.
- file__20m.csv: 20 million records
- file__15m.csv: 15 million records
- file__12m.csv: 12 million records
- file__10m.csv: 10 million records
- And so on...
If you want to know more about the shuffling process, you can check this other repository: A tool to automatically Shuffle lines in .csv files, the shuffling process helps us to:
- Increase the probability of finding all the data types present in a single column.
- Avoid iterate the entire dataset.
- Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction.
Any ideas or feedback about this repository?. Help me to improve it.
- Created by Ramses Alexander Coraspe Valdez
- Created on 2022
This project is licensed under the terms of the MIT License.