-
Notifications
You must be signed in to change notification settings - Fork 0
understanding the data
Tansu Dasli edited this page Sep 19, 2023
·
19 revisions
understanding the data is about
- null, missing, duplicates, wrong values (-99,..) & types (numeric/categorical)
- outliers
- relations
- relations statistical tests
categorical | continuous
categorical chi-square | T-test or ANOVA
continuous logistic-regression | correlation
-
bias vs variance -> in data (histogram,
skewness = Δ(median, mean)
) -
outliers -> skewness =
Δ(median, mean)
(μ ± 3 * σ)
, if normally distributed histogramInter quartile method => `percentiles ± 1.5 * [IQR=q3-q1]`, anomaly detection
-
types ->
df.info()
-
null ->
df.isna().sum
-
category ->
df.describe(include='all')
-
median -> median == 50% percentile
-
wrong data ->
-
imbalanced ->
df.fieldName.value_counts()
-
statistics ->
df.describe(include="all)
-
scatter matrix ->
pd.plotting.scatter_matrix(df)
-
heat map ->
sns.heatmap(df.corr())
-
correlations ->
df.corr()