Skip to content

Latest commit

 

History

History
239 lines (177 loc) · 6.97 KB

sklearn-cheatsheet.md

File metadata and controls

239 lines (177 loc) · 6.97 KB
title description created
Scikit-learn CheatSheet
The most important and useful methods and functions of scikit-learn are given here.
2022-10-31

Table of Contents

Scikit-learn CheatSheet for Developers

Training, Validation, and Test Sets

train_test_split - scikit-learn.org

from sklearn.model_selection import train_test_split
x_trn, x_other, y_trn, y_other = train_test_split(x, y, train_size=0.7, random_state=0)
x_val, x_tst, y_val, y_tst = train_test_split(x_other, y_other, test_size=0.33, random_state=1)

🔼Back to Top

Preprocessing

For many machine learning models, various preprocessing techniques not only help improve efficiency, but often are important for ensuring meaningful results.

  • StandardScaler (aka Z-score)
  • Normalizer (vector normalization)
  • Binarizer The typical process is:
    1. Choose appropriate preprocessing method and import it
    2. Construct a rescale object by fitting the chosen method to the training set only!
    3. Transform your training, validation, and test sets using constructed rescale object.
# Example:  Standarization / Z Scoring
#   -- the procedure is the same for Normalizer and Binarizer
from sklearn.preprocessing import StandardScaler
rescale = StandardScalar.fit(x_trn)
xx_trn = rescale.transform(x_trn)
xx_val = rescale.transform(x_val)
xx_tst = rescale.transform(x_tst)

🔼Back to Top

Modeling

Scikit-Learn makes modeling really easy. The recipe is:

  1. Construct
  2. Fit
  3. Predict
  4. Evaluate

In the subsections that follow, we cover the first 3 steps. Model evaluation is covered in the next section.

🔼Back to Top

Supervised Learning

Linear Regression

Function Documentation

from skearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(xx_trn, y_tr)
y_pred = model.predict(xx_val)

🔼Back to Top

Logistic Regression

Function Documentation

from skearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(xx_trn, y_tr)
y_pred = model.predict(xx_val)

🔼Back to Top

Support Vector Machines

Documentation

from skearn.svm import SVC
model = SVC(kernel='linear') # other kernels: polynomial, rbf, sigmoid
model.fit(xx_trn, y_tr)
y_pred = model.predict(xx_val)

🔼Back to Top

Naive Bayes

from skearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(xx_trn, y_tr)
y_pred = model.predict(xx_val)

🔼Back to Top

KNN

from skearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(xx_trn, y_tr)
y_pred = model.predict(xx_val)

🔼Back to Top

Decision Tree

from skearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=0)
model.fit(x_trn, y_tr)  # Vars do not have to be normalized/standardized for DTs!
y_pred = model.predict(x_val)

🔼Back to Top

Gradient Boosting

from skearn.ensemble import GradientBoostinClassifier
model = GradientBoostinClassifier(max_depth=5, n_estimators=1000, 
  subsample=0.5, random_state=0, learning_rate=0.001)
model.fit(x_trn, y_tr)  # Vars do not have to be normalized/standardized for DTs!
y_pred = model.predict(x_val)

🔼Back to Top

Random Forest (Bagged Decision Trees)

from skearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=1000, criterion='entropy', 
   n_jobs=4, max_depth=10)
model.fit(x_trn, y_tr)  # Vars do not have to be normalized/standardized for DTs!
y_pred = model.predict(x_val)

🔼Back to Top

Unsupervised Learning

PCA

from skearn.decomposition import PCA
model = PCA(n_components=0.95)
model.fit(xx_trn, y_tr)
#y_pred = model.predict(xx_val)

🔼Back to Top

k-Means Clustering

from skearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=1)
model.fit(xx_trn, y_tr)
#y_pred = model.predict(xx_val)

🔼Back to Top


Additional Notes

Decision Tree Boosting

This is basically what is happening...

from skearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=2, random_state=0)
N_estimators = 3
for i in range(N_estimators):
  model.fit(x_trn, y_trn)  
  y_res = y_trn - model.predict(x_trn)
y_pred = y_res

🔼Back to Top

Cross Validation and Grid Search

#from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
###
#skf = StratifiedKFold(n_splits=2)
model_type = GradientBoostingClassifier(n_estimators=500, learning_rate=.01)
params = {"max_depth": [3, 5, 7]}
###
model = GridSearchCV(model_type, param_grid=params, verbose=2)
model.fit(x_trn, y_trn)

🔼Back to Top