Csv Schema Inference

A tool to automatically infer columns data types in .csv files

Check the article here: Building a Schema Inference Data Pipeline for Large CSV files

Installing csv-schema-inference 🔧

pip install csv-schema-inference

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
  Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.9

Importing csv-schema-inference library ⚡

from csv_schema_inference import csv_schema_inference

Setting csv-schema-inference configuration ✍

#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}

csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
pathfile = "/content/file__500k.csv"

Run inference 🏃

aprox_schema = csv_infer.run_inference(pathfile)

Showing the approximate data type inference for each column 🔍

csv_infer.pretty(aprox_schema)

0
	name
		id
	type
		INTEGER
	nullable
		False
1
	name
		full_name
	type
		STRING
	nullable
		True
2
	name
		age
	type
		INTEGER
	nullable
		False
3
	name
		city
	type
		STRING
	nullable
		True
4
	name
		weight
	type
		FLOAT
	nullable
		False
5
	name
		height
	type
		FLOAT
	nullable
		False
6
	name
		isActive
	type
		BOOLEAN
	nullable
		False
7
	name
		col_int1
	type
		INTEGER
	nullable
		False
8
	name
		col_int2
	type
		INTEGER
	nullable
		False
9
	name
		col_int3
	type
		INTEGER
	nullable
		False
10
	name
		col_float1
	type
		FLOAT
	nullable
		False
11
	name
		col_float2
	type
		FLOAT
	nullable
		False
12
	name
		col_float3
	type
		FLOAT
	nullable
		False
13
	name
		col_float4
	type
		FLOAT
	nullable
		False
14
	name
		col_float5
	type
		FLOAT
	nullable
		False
15
	name
		col_float6
	type
		FLOAT
	nullable
		False
16
	name
		col_float7
	type
		FLOAT
	nullable
		False
17
	name
		col_float8
	type
		FLOAT
	nullable
		False
18
	name
		col_float9
	type
		FLOAT
	nullable
		False
19
	name
		col_float10
	type
		FLOAT
	nullable
		False
20
	name
		test_column
	type
		FLOAT
	nullable
		False

Checking schema values for specific columns ✔

result = csv_infer.get_schema_columns(columns = {"test_column"})
csv_infer.pretty(result)

20
	_name
		test_column
	types_found
		INTEGER
			cnt
				406130
		FLOAT
			cnt
				50964
	nullable
		False
	type
		FLOAT

Explore all possible data types for a specific columns ✅

result = csv_infer.explore_schema_column(column = "test_column")
csv_infer.pretty(result)

20
	name
		test_column
	types_found
		INTEGER
			88.85043339006856
		FLOAT
			11.149566609931437
	nullable
		False

Benchmark

The tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time.

file__20m.csv: 20 million records
file__15m.csv: 15 million records
file__12m.csv: 12 million records
file__10m.csv: 10 million records
And so on...

If you want to know more about the shuffling process, you can check this other repository: A tool to automatically Shuffle lines in .csv files, the shuffling process helps us to:

Increase the probability of finding all the data types present in a single column.
Avoid iterate the entire dataset.
Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction.

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

Created by Ramses Alexander Coraspe Valdez
Created on 2022

License

This project is licensed under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
benchmark		benchmark
csv_schema_inference		csv_schema_inference
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
_config.yml		_config.yml
googled57bdb220576a44a.html		googled57bdb220576a44a.html
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Csv Schema Inference

Check the article here: Building a Schema Inference Data Pipeline for Large CSV files

Installing csv-schema-inference 🔧

Importing csv-schema-inference library ⚡

Setting csv-schema-inference configuration ✍

Run inference 🏃

Showing the approximate data type inference for each column 🔍

Checking schema values for specific columns ✔

Explore all possible data types for a specific columns ✅

Benchmark

Contributing and Feedback

Authors

License

About

Languages

License

Wittline/csv-schema-inference

Folders and files

Latest commit

History

Repository files navigation

Csv Schema Inference

Check the article here: Building a Schema Inference Data Pipeline for Large CSV files

Installing csv-schema-inference 🔧

Importing csv-schema-inference library ⚡

Setting csv-schema-inference configuration ✍

Run inference 🏃

Showing the approximate data type inference for each column 🔍

Checking schema values for specific columns ✔

Explore all possible data types for a specific columns ✅

Benchmark

Contributing and Feedback

Authors

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages