Completed as an assignment for CS- F469, Information Retrieval, BITS Pilani, Pilani campus, First Semester 2020
This directory contains the following files
- (Sub-directory)
Text_corpus
- contains the text corpus- wiki_00, download link.
- (Sub-directory)
Storage
- contains pickle files (.pkl) for generated by index_creation.py - (Sub-directory)
Documents
- conatins individual documents generated by index_creation.py index_creation.py
- Primary code file, constructs the vector space based indexquery_processer.py
- Python file containing functions for query processingtest_queries.py
- takes query as input and returns the top 10 retrieved documentsWordNetImprovement.py
- conatains the WordNetimprovement class with methods to perform query relaxation (part 2)README.md
requirements.txt
- Make sure that files and sub-directories are in order specified above, the subdirectories Documents and Storage are created during runtime.
- All the dependencies are listed in
requirements.txt
. Make sure thatrequirements.txt
is in the directory containing the code files and its your present working directory, execute the following on command line (Windows)
$ pip install -r requirements.txt
- After the installation, run the following code in a python interpreter
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
- Firstly run the
index_creation.py
in a python IDE, this parses through the text corpus and creates the index along with .pkl files to be used during processing and separate documents based on their doc_id in the parsed corpus. - We are now ready to run the
test_queries.py
to retrieve the documents.- Prints the top 10 documents with their scores and title.
open_web
arguement allows user to open the search results in browser windowuse_zones
implements Zonal Indexing, suggested as an inprovemnt for ranking and retrievalenable_query_relaxation
implements query relaxation using WordNet synsets- input 1 as arguement for hypernym based relaxation
- input 2 as arguement for synonym based relaxation
Note: Since we are using an multiple data structures to represent the corpus, passing a single location for the inveted index in test_queries.py
file won't cut it and passing multiple locations is not efficient. Therefore, we have stored and read the required files from their default locations
All operational deails are commented out or are present as docstrings in the code files. All discussions are present in the assignment report.