-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Tansu Dasli edited this page Sep 22, 2023
·
25 revisions
general structure of an ML pipeline
- Gathering data sampling
- EDA is about understanding the data
- common | missing, wrong, null, duplicates
- outliers | IQR, anomaly detection
- relations | statistical tests
- Preprocessing
- handling | missing, wrong, null, duplicates
- feature scaling | standardization vs normalization
- feature selection |
- feature extraction | dimension reduction
- encoding | dummy categorical fields
- discretization | binning continuous fields
- Train-Test split sampling
- Model (fit -> predict -> hyper-tune)
- Regression | supervised | predict continuous features
- Classification | supervised | predict categorized features
- Clustering | unsupervised | discover groups, density estimation, dimension reduction
- Evaluation metrics
- Regression | Cost function, R²
| cosine similarity (distance b/w two vectors)
- Classification | Confusion matrix( accuracy, f1-score, ROC), cost of misclassification matrix, accuracy weight matrix
| cross entropy (distance b/w two possibility distribution)
- Clustering | tendency, #k, quality (V-measure, silhouette-score)
purpose is important for the evaluation of clustering. If it is
- for another model, check the improvement
- for itself, build ground truth labeling