Skip to content

Latest commit

 

History

History
193 lines (118 loc) · 4.52 KB

File metadata and controls

193 lines (118 loc) · 4.52 KB

6.10 Homework

The goal of this homework is to create a tree-based regression model for prediction apartment prices (column 'price').

In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.

You can take it from Kaggle or download from here if you don't want to sign up to Kaggle.

For this homework, we prepared a starter notebook.

Solution

Notebook with solution

Loading the data

  • Use only the following columns:
    • 'neighbourhood_group',
    • 'room_type',
    • 'latitude',
    • 'longitude',
    • 'minimum_nights',
    • 'number_of_reviews','reviews_per_month',
    • 'calculated_host_listings_count',
    • 'availability_365',
    • 'price'
  • Fill NAs with 0
  • Apply the log tranform to price
  • Do train/validation/test split with 60%/20%/20% distribution.
  • Use the train_test_split function and set the random_state parameter to 1
  • Use DictVectorizer to turn the dataframe into matrices

Question 1

Let's train a decision tree regressor to predict the price variable.

  • Train a model with max_depth=1

Which feature is used for splitting the data?

  • room_type
  • neighbourhood_group
  • number_of_reviews
  • reviews_per_month

Question 2

Train a random forest model with these parameters:

  • n_estimators=10
  • random_state=1
  • n_jobs=-1 (optional - to make training faster)

What's the RMSE of this model on validation?

  • 0.059
  • 0.259
  • 0.459
  • 0.659

Question 3

Now let's experiment with the n_estimators parameter

  • Try different values of this parameter from 10 to 200 with step 10
  • Set random_state to 1
  • Evaluate the model on the validation dataset

After which value of n_estimators does RMSE stop improving?

  • 10
  • 50
  • 70
  • 120

Question 4

Let's select the best max_depth:

  • Try different values of max_depth: [10, 15, 20, 25]
  • For each of these values, try different values of n_estimators from 10 till 200 (with step 10)
  • Fix the random seed: random_state=1

What's the best max_depth:

  • 10
  • 15
  • 20
  • 25

Bonus question (not graded):

Will the answer be different if we change the seed for the model?

Question 5

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorith, it finds the best split. When doint it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the imporatant features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the feature_importances_ field.

For this homework question, we'll find the most important feature:

  • Train the model with these parametes:
    • n_estimators=10,
    • max_depth=20,
    • random_state=1,
    • n_jobs=-1 (optional)
  • Get the feature importance information from this model

What's the most important feature?

  • neighbourhood_group=Manhattan
  • room_type=Entire home/apt
  • longitude
  • latitude

Question 6

Now let's train an XGBoost model! For this question, we'll tune the eta parameter

  • Install XGBoost
  • Create DMatrix for train and validation
  • Create a watchlist
  • Train a model with these parameters for 100 rounds:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

Now change eta first to 0.1 and then to 0.01

Which eta leads to the best RMSE score on the validation dataset?

  • 0.3
  • 0.1
  • 0.01

Submit the results

Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8

It's possible that your answers won't match exactly. If it's the case, select the closest one.

Deadline

The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.

Nagivation