Skip to content

Retrieves the top 10 documents from the Wikipedia corpus for a user inputted free-text query

License

Notifications You must be signed in to change notification settings

manan-paneri-99/Vector-Space-based-Document-Retrieval-system

Repository files navigation

Vector Space based Document Retrieval system

Completed as an assignment for CS- F469, Information Retrieval, BITS Pilani, Pilani campus, First Semester 2020

Files

This directory contains the following files

  • (Sub-directory) Text_corpus- contains the text corpus
  • (Sub-directory) Storage- contains pickle files (.pkl) for generated by index_creation.py
  • (Sub-directory) Documents- conatins individual documents generated by index_creation.py
  • index_creation.py- Primary code file, constructs the vector space based index
  • query_processer.py- Python file containing functions for query processing
  • test_queries.py- takes query as input and returns the top 10 retrieved documents
  • WordNetImprovement.py- conatains the WordNetimprovement class with methods to perform query relaxation (part 2)
  • README.md
  • requirements.txt

Procedure

  1. Make sure that files and sub-directories are in order specified above, the subdirectories Documents and Storage are created during runtime.
  2. All the dependencies are listed in requirements.txt. Make sure that requirements.txt is in the directory containing the code files and its your present working directory, execute the following on command line (Windows)
$ pip install -r requirements.txt
  1. After the installation, run the following code in a python interpreter
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
  1. Firstly run the index_creation.py in a python IDE, this parses through the text corpus and creates the index along with .pkl files to be used during processing and separate documents based on their doc_id in the parsed corpus.
  2. We are now ready to run the test_queries.py to retrieve the documents.
    • Prints the top 10 documents with their scores and title.
    • open_web arguement allows user to open the search results in browser window
    • use_zones implements Zonal Indexing, suggested as an inprovemnt for ranking and retrieval
    • enable_query_relaxation implements query relaxation using WordNet synsets
      • input 1 as arguement for hypernym based relaxation
      • input 2 as arguement for synonym based relaxation

Note: Since we are using an multiple data structures to represent the corpus, passing a single location for the inveted index in test_queries.pyfile won't cut it and passing multiple locations is not efficient. Therefore, we have stored and read the required files from their default locations

All operational deails are commented out or are present as docstrings in the code files. All discussions are present in the assignment report.

About

Retrieves the top 10 documents from the Wikipedia corpus for a user inputted free-text query

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published