Skip to content

understanding the data

Tansu Dasli edited this page Sep 22, 2023 · 19 revisions

understanding the data is about

  1. null, missing, duplicates, wrong values (-99,..) & types (numeric/categorical)
  2. outliers
  3. relations
  • relations statistical tests -> chi-square, t-test, ANOVA, log. regression, correlation
  • bias vs variance -> in data (histogram, skewness = Δ(median, mean))
  • outliers
    • identify -> skewness = Δ(median, mean), histogram,
    • handling -> IQR percentiles ± 1.5 * [IQR=q3-q1], (μ ± 3 * σ) if normally distributed or anomaly detection,
  • types -> df.info()
  • null -> df.isna().sum
  • category -> df.describe(include='all')
  • median -> median == 50% percentile
  • wrong data -> df.groupby(['field-name','...'].aggregationFunctions)
  • imbalanced -> df.field-name.value_counts()
  • statistics -> df.describe(include="all)
  • scatter matrix -> pd.plotting.scatter_matrix(df)
  • correlation -> sns.heatmap(df.corr())
  • bird-eye look -> df.plot()
  • plotting -> fig, ax = plt.subplots(nrows=, ncols=, figsize=(,)), use as sns.heatmap(... , ax=ax[0])