Skip to content
Tansu Dasli edited this page Sep 22, 2023 · 25 revisions

general structure of an ML pipeline

- common             | missing, wrong, null, duplicates
- outliers           | IQR, anomaly detection
- relations          | statistical tests
  • Preprocessing
- handling           | missing, wrong, null, duplicates
- feature scaling    | standardization vs normalization
- feature selection  |
- feature extraction | dimension reduction 
- encoding           | dummy categorical fields
- discretization     | binning continuous fields
  • Train-Test split sampling
  • Model (fit -> predict -> hyper-tune)
- Regression         | supervised   | predict continuous features
- Classification     | supervised   | predict categorized features 
- Clustering         | unsupervised | discover groups, density estimation, dimension reduction
  • Evaluation metrics
- Regression         | Cost function, R² 
                     | cosine similarity (distance b/w two vectors)
- Classification     | Confusion matrix( accuracy, f1-score, ROC), cost of misclassification matrix, accuracy weight matrix
                     | cross entropy (distance b/w two possibility distribution)
- Clustering         | tendency, #k, quality (V-measure, silhouette-score)

purpose is important for the evaluation of clustering. If it is

  • for another model, check the improvement
  • for itself, build ground truth labeling