A TA1 primitives for d3m project Currently it generates data profiles for tabular data. We use DataFrame (supported by Pandas) as our main data type.
see setup.py
-
install d3m project first.
-
use pip:
pip install dsbox-dataprofiling
see example.py
you can chose the metafeautures based on what you need. Our computable metafeatures including:
computable_metafeatures = ['ratio_of_values_containing_numeric_char', 'ratio_of_numeric_values',
'number_of_outlier_numeric_values', 'num_filename', 'number_of_tokens_containing_numeric_char',
'number_of_numeric_values_equal_-1', 'most_common_numeric_tokens', 'most_common_tokens',
'ratio_of_distinct_tokens', 'number_of_missing_values',
'number_of_distinct_tokens_split_by_punctuation', 'number_of_distinct_tokens',
'ratio_of_missing_values', 'semantic_types', 'number_of_numeric_values_equal_0',
'number_of_positive_numeric_values', 'most_common_alphanumeric_tokens',
'numeric_char_density', 'ratio_of_distinct_values', 'number_of_negative_numeric_values',
'target_values', 'ratio_of_tokens_split_by_punctuation_containing_numeric_char',
'ratio_of_values_with_leading_spaces', 'number_of_values_with_trailing_spaces',
'ratio_of_values_with_trailing_spaces', 'number_of_numeric_values_equal_1',
'natural_language_of_feature', 'most_common_punctuations', 'spearman_correlation_of_features',
'number_of_values_with_leading_spaces', 'ratio_of_tokens_containing_numeric_char',
'number_of_tokens_split_by_punctuation_containing_numeric_char', 'number_of_numeric_values',
'ratio_of_distinct_tokens_split_by_punctuation', 'number_of_values_containing_numeric_char',
'most_common_tokens_split_by_punctuation', 'number_of_distinct_values',
'pearson_correlation_of_features']
for the specific meaning and data structure of the metafeature, you can lookup this page: data_metafeatures