Skip to content

understanding the data

Tansu Dasli edited this page Sep 19, 2023 · 19 revisions

understanding the data is about

  1. null, missing, duplicates, wrong values (-99,..) & types (numeric/categorical)
  2. outliers
  3. relations
                        categorical         |  continuous
         categorical    chi-square          |  T-test or ANOVA
         continuous     logistic-regression |  correlation
  • bias vs variance -> in data (histogram, skewness = Δ(median, mean))

  • outliers -> skewness = Δ(median, mean) (μ ± 3 * σ), if normally distributed histogram

                    Inter quartile method => `percentiles ± 1.5 * [IQR=q3-q1]`,
                    anomaly detection
    
  • types -> df.info()

  • null -> df.isna().sum

  • category -> df.describe(include='all')

  • median -> median == 50% percentile

  • wrong data ->

  • imbalanced -> df.fieldName.value_counts()

  • statistics -> df.describe(include="all)

  • scatter matrix -> pd.plotting.scatter_matrix(df)

  • heat map -> sns.heatmap(df.corr())

  • correlations -> df.corr()