4.10 Homework

Use this notebook as a starter

We'll use the credit scoring dataset:

https://github.com/gastonstat/CreditScoring
Also available here

Preparation

Execute the preparation code from the starter notebook
Split the dataset into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split funciton for that with random_state=1

Question 1

ROC AUC could also be used to evaluate feature importance of numerical variables.

Let's do that

For each numerical variable, use it as score and compute AUC with the "default" variable
Use the training dataset for that

If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. -df_train['expenses'])

AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

seniority
time
income
debt

Training the model

From now on, use these columns only:

['seniority', 'income', 'assets', 'records', 'job', 'home']

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:

LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)

Question 2

What's the AUC of this model on the validation dataset? (round to 3 digits)

0.512
0.612
0.712
0.812

Question 3

Now let's compute precision and recall for our model.

Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
For each threshold, compute precision and recall
Plot them

At which threshold precision and recall curves intersect?

0.2
0.4
0.6
0.8

Question 4

Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both

This is the formula for computing F1:

F1 = 2 * P * R / (P + R)

Where P is precision and R is recall.

Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01

At which threshold F1 is maximal?

0.1
0.3
0.5
0.7

Question 5

Use the KFold class from Scikit-Learn to evaluate our model on 5 different folds:

KFold(n_splits=5, shuffle=True, random_state=1)

Iterate over different folds of df_full_train
Split the data into train and validation
Train the model on train with these parameters: LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
Use AUC to evaluate the model on validation

How large is standard devidation of the AUC scores across different folds?

0.001
0.014
0.09
0.14

Question 6

Now let's use 5-Fold cross-validation to find the best parameter C

Iterate over the following C values: [0.01, 0.1, 1, 10]
Initialize KFold with the same parameters as previously
Use these parametes for the model: LogisticRegression(solver='liblinear', C=C, max_iter=1000)
Compute the mean score as well as the std (round the mean and std to 3 decimal digits)

Which C leads to the best mean score?

0.01
0.1
1
10

If you have ties, select the score with the lowest std. If you still have ties, select the smallest C

Submit the results

Submit your results here: https://forms.gle/e497sR5iB36mM9Cs5

It's possible that your answers won't match exactly. If it's the case, select the closest one.

Deadline

The deadline for submitting is 04 October 2021, 17:00 CET. After that, the form will be closed.

Nagivation

Machine Learning Zoomcamp course
Session 4: Evaluation Metrics for Classification
Previous: Explore more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

homework.md

homework.md

4.10 Homework

Preparation

Question 1

Training the model

Question 2

Question 3

Question 4

Question 5

Question 6

Submit the results

Deadline

Nagivation

Files

homework.md

Latest commit

History

homework.md

File metadata and controls

4.10 Homework

Preparation

Question 1

Training the model

Question 2

Question 3

Question 4

Question 5

Question 6

Submit the results

Deadline

Nagivation