Use this notebook as a starter
We'll use the credit scoring dataset:
- https://github.com/gastonstat/CreditScoring
- Also available here
- Execute the preparation code from the starter notebook
- Split the dataset into 3 parts: train/validation/test with 60%/20%/20% distribution. Use
train_test_split
funciton for that withrandom_state=1
ROC AUC could also be used to evaluate feature importance of numerical variables.
Let's do that
- For each numerical variable, use it as score and compute AUC with the "default" variable
- Use the training dataset for that
If your AUC is < 0.5, invert this variable by putting "-" in front
(e.g. -df_train['expenses']
)
AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.
Which numerical variable (among the following 4) has the highest AUC?
- seniority
- time
- income
- debt
From now on, use these columns only:
['seniority', 'income', 'assets', 'records', 'job', 'home']
Apply one-hot-encoding using DictVectorizer
and train the logistic regression with these parameters:
LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
What's the AUC of this model on the validation dataset? (round to 3 digits)
- 0.512
- 0.612
- 0.712
- 0.812
Now let's compute precision and recall for our model.
- Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
- For each threshold, compute precision and recall
- Plot them
At which threshold precision and recall curves intersect?
- 0.2
- 0.4
- 0.6
- 0.8
Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both
This is the formula for computing F1:
F1 = 2 * P * R / (P + R)
Where P is precision and R is recall.
Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01
At which threshold F1 is maximal?
- 0.1
- 0.3
- 0.5
- 0.7
Use the KFold
class from Scikit-Learn to evaluate our model on 5 different folds:
KFold(n_splits=5, shuffle=True, random_state=1)
- Iterate over different folds of
df_full_train
- Split the data into train and validation
- Train the model on train with these parameters:
LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
- Use AUC to evaluate the model on validation
How large is standard devidation of the AUC scores across different folds?
- 0.001
- 0.014
- 0.09
- 0.14
Now let's use 5-Fold cross-validation to find the best parameter C
- Iterate over the following C values:
[0.01, 0.1, 1, 10]
- Initialize
KFold
with the same parameters as previously - Use these parametes for the model:
LogisticRegression(solver='liblinear', C=C, max_iter=1000)
- Compute the mean score as well as the std (round the mean and std to 3 decimal digits)
Which C leads to the best mean score?
- 0.01
- 0.1
- 1
- 10
If you have ties, select the score with the lowest std. If you still have ties, select the smallest C
Submit your results here: https://forms.gle/e497sR5iB36mM9Cs5
It's possible that your answers won't match exactly. If it's the case, select the closest one.
The deadline for submitting is 04 October 2021, 17:00 CET. After that, the form will be closed.