Scalable pyspark implementation of an algorithm to retrieve similar documents in a corpus.
This project was submitted as final assignment for the Algorithms for Massive Data class, MsC in Data Science and Economics, University of Milan.
The notebook was run on google colab. Commenter privileges have been granted to anyone accessing the notebook via link