-
Notifications
You must be signed in to change notification settings - Fork 0
understanding the data
Tansu Dasli edited this page Sep 20, 2023
·
19 revisions
understanding the data is about
- null, missing, duplicates, wrong values (-99,..) & types (numeric/categorical)
- outliers
- relations
- relations statistical tests -> chi-square, t-test, ANOVA, log. regression, correlation
-
bias vs variance -> in data (histogram,
skewness = Δ(median, mean)
) - outliers ->
skewness = Δ(median, mean)
, histogram, IQRpercentiles ± 1.5 * [IQR=q3-q1]
,(μ ± 3 * σ)
or anomaly detection, - types ->
df.info()
- null ->
df.isna().sum
- category ->
df.describe(include='all')
- median -> median == 50% percentile
- wrong data ->
df.groupby(['field-name','...'].aggregationFunctions)
- imbalanced ->
df.field-name.value_counts()
- statistics ->
df.describe(include="all)
- scatter matrix ->
pd.plotting.scatter_matrix(df)
- correlation ->
sns.heatmap(df.corr())