Skip to content

understanding the data

Tansu Dasli edited this page Sep 18, 2023 · 19 revisions

understanding the data which means simply about

  1. null, missing, duplicates, wrong values, and types (numeric & categorical),
  2. outliers,
  3. relations,
  • relations -> statistical tests
                        categorical         |  continuous
         categorical    chi-square          |  T-test or ANOVA
         continuous     logistic-regression |  correlation
  • bias vs variance -> in data (histogram, skewness = Δ(median, mean)) in model (under-fit, over-fit)

  • outliers -> skewness = Δ(median, mean) (μ ± 3 * σ), if normally distributed histogram

                    Inter quartile method => `percentiles ± 1.5 * [IQR=q3-q1]`,
                    anomaly detection
    
  • types -> df.info() critical for statistical tests!

  • null -> .isna().sum

  • category -> .describe(include='all')

  • wrong data ->

  • imbalanced -> df.y.value_counts() (only in classification problem)

  • statistics -> .describe(include="all)

  • scatter matrix -> pd.plotting.scatter_matrix(df)

  • heatmap -> sns.heatmap(df.corr())

  • correlations -> df.corr()