-
Notifications
You must be signed in to change notification settings - Fork 0
understanding the data
Tansu Dasli edited this page Sep 22, 2023
·
19 revisions
understanding the data is about
- null, missing, duplicates, wrong values (-99,..) & types (numeric/categorical)
- outliers
- relations
- relations statistical tests -> chi-square, t-test, ANOVA, log. regression, correlation
-
bias vs variance -> in data (histogram,
skewness = Δ(median, mean)
) - outliers
- identify ->
skewness = Δ(median, mean)
, histogram, - handling -> IQR
percentiles ± 1.5 * [IQR=q3-q1]
,(μ ± 3 * σ)
if normally distributed or anomaly detection,
- identify ->
- types ->
df.info()
- null ->
df.isna().sum
- category ->
df.describe(include='all')
- median -> median == 50% percentile
- wrong data ->
df.groupby(['field-name','...'].aggregationFunctions)
- imbalanced ->
df.field-name.value_counts()
- statistics ->
df.describe(include="all)
- scatter matrix ->
pd.plotting.scatter_matrix(df)
- correlation ->
sns.heatmap(df.corr())
- bird-eye look ->
df.plot()
- plotting ->
fig, ax = plt.subplots(nrows=, ncols=, figsize=(,))
, use assns.heatmap(... , ax=ax[0])