In this project, we intend to predict the booking destination of a newly onboarded user on Airbnb platform. To evaluate the model's performance, we used NDCG as the metric to improve the degree of relevance and the ranking of the predictions we make. With this, Airbnb could potentially
- Provide personalized experience
- Better forecast demand
The problem statement and dataset is derived from Kaggle's - Airbnb New User Bookings Challenge. The predictor variables include information about the users (user_id, age, gender etc.) and preliminary session data (actions, action types, session time etc.). Our objective would be to predict the country (dependent variable) that the user is most likely to visit. It is to be noted that only a limited set of users have associated session data and therefore, we merge (inner join) the datasets and proceed for modelling with about 5.5 mil session observations for close to 73k users.
In order to produce the results, we performs data preparation, data preprocessing, feature engineering, model building, evaluation and yperparameter tuning. Some amount of EDA was done to understand the dataset which can be found here.
Some challenges in this dataset is handling the imbalanced dataset and limited information available.
The models we try out are as follows: Multinomial regression - using Softmax function and L2 regularization applied to help with classifying our target variables beyond the two categories where we apply logistic regression.
Bernoulli Naive Bayes - Bernoulli Naïve Bayes is well suited for discrete data with binary features which was the case after we completed feature engineering.
Decision Trees - highly predictive due to their capability of mapping non-linear relationships well. Results are also easily interpretable within the business context.
XGboost - Allows us to leverage its regularization technique (using both L1 and L2), sparsity awareness (robust learning from missing values) and in-built cross validation.
Consequently, Xboost gave the best performance of a NDCG score of 88.323.
- Airbnb can consider data on detailed user demographics, as well as sessions' data (e.g., session time and data, search queries, etc.)
- Work with relevant stakeholders to further refine feature selection.
- We can cansider Novelty as a metric for recommending new travel destinations to users