In this task, we focused on using nearest neighbors and clustering to retrieve documents that interest users, by analyzing their text. We explored two document representations: word counts and TF-IDF. We also built an Jupyter notebook for retrieving articles from Wikipedia about famous people.
Then we dug deeper into this application, compare results with word counts and TF-IDF, explore the retrieval results for various famous people, and familiarize ourselves with the code needed to build a retrieval system.
-
Data: people_wiki.sframe
Or if you are using pandas and scikit-learn, you can read people_wiki.csv
-
Code: Retrieving Wikipedia articles.ipynb