understanding the data

Jump to bottom

Tansu Dasli edited this page Sep 18, 2023 · 19 revisions

understanding the data which means simply about

null, missing, duplicates, wrong values, and types (numeric & categorical),
outliers,
relations,

relations -> statistical tests

                        categorical         |  continuous
         categorical    chi-square          |  T-test or ANOVA
         continuous     logistic-regression |  correlation

bias vs variance -> in data (histogram, skewness = Δ(median, mean)) in model (under-fit, over-fit)

outliers -> skewness = Δ(median, mean) (μ ± 3 * σ), if normally distributed histogram

                Inter quartile method => `percentiles ± 1.5 * [IQR=q3-q1]`,
                anomaly detection

types -> df.info() critical for statistical tests!
null -> .isna().sum
category -> .describe(include='all')
wrong data ->
imbalanced -> df.y.value_counts() (only in classification problem)
statistics -> .describe(include="all)
scatter matrix -> pd.plotting.scatter_matrix(df)
heatmap -> sns.heatmap(df.corr())
correlations -> df.corr()