This repo intends to be a tour through some recommendation algorithms in python using various dataset. Companion posts are:
-
Recotour: a tour through recommendation algorithms in python
-
RecoTour III: Variational Autoencoders for Collaborative Filtering with Mxnet and Pytorch.
The repo is organised as follows:
- recotour: this is the original "tour" through recommendation algorithms
using the Ponpare
coupon dataset. In particular, the algorithms included in the
recotour
directory are:- Data processing, with a deep dive into feature engineering
- Most Popular recommendations (the baseline)
- Item-User similarity based recommendations
- kNN Collaborative Filtering recommendations
- GBM based recommendations using
lightGBM
with a tutorial on how to optimize gbms - Non-Negative Matrix Factorization recommendations
- Factorization Machines (Steffen Rendle 2010) recommendations using
xlearn
- Field Aware Factorization Machines (Yuchin Juan, et al, 2016) recommendations using
xlearn
- Deep Learning based recommendations (Wide and Deep, Heng-Tze Cheng, et al, 2016) using
pytorch
I have included a more modular (nicer looking) version of a possible final
solution (described in Chapter16_final_solution_Recommendations.ipynb
) in
the directory final_recommendations
.
In addition, I have included an illustration of how to use other evaluation
metrics apart from the one shown in the notebooks ( the mean average precision
or MAP) such as the Normalized Discounted Cumulative Gain
(NDCG). This can
be found in using_ncdg.py
in the directory py_scripts
.
In addition, there are other, DL-based recommendation algorithms that use mainly the Amazon Reviews dataset, in particular the 5-core Movies and TV reviews. These are:
- neural_cf: Neural Collaborative Filtering (Xiangnan He et al., 2017)
- neural_graph_cf: Neural Graph Collaborative Filtering (Wang Xiang et al. 2019)
- mult-vae: Variational Autoencoders for Collaborative Filtering (Dawen Liang et al,. 2018)
The core of the repo are the notebooks in each directory. They intend to be self-contained and in consequence, there is some of code repetition. The code is, of course, "notebook-oriented". The notebooks have plenty of explanations and references to relevant papers or packages. My intention was to focus on the code, but you will also find some math.
I hope the code here is useful to someone. If you have any idea on how to improve the content of the repo, or you want to contribute, let me know.