-
Notifications
You must be signed in to change notification settings - Fork 0
understanding the data
Tansu Dasli edited this page Sep 18, 2023
·
19 revisions
understanding the data which means simply about
- null, missing, duplicates, wrong values, and types (numeric & categorical),
- outliers,
- relations,
- relations -> statistical tests
categorical | continuous
categorical chi-square | T-test or ANOVA
continuous logistic-regression | correlation
-
bias vs variance -> in data (histogram,
skewness = Δ(median, mean)
) in model (under-fit, over-fit) -
outliers -> skewness =
Δ(median, mean)
(μ ± 3 * σ)
, if normally distributed histogramInter quartile method => `percentiles ± 1.5 * [IQR=q3-q1]`, anomaly detection
-
types ->
df.info()
critical for statistical tests! -
null ->
.isna().sum
-
category ->
.describe(include='all')
-
wrong data ->
-
imbalanced -> df.y.value_counts() (only in classification problem)
-
statistics -> .describe(include="all)
-
scatter matrix -> pd.plotting.scatter_matrix(df)
-
heatmap -> sns.heatmap(df.corr())
-
correlations -> df.corr()