Vector Space based Document Retrieval system

Completed as an assignment for CS- F469, Information Retrieval, BITS Pilani, Pilani campus, First Semester 2020

Files

This directory contains the following files

(Sub-directory) Text_corpus- contains the text corpus
- wiki_00, download link.
(Sub-directory) Storage- contains pickle files (.pkl) for generated by index_creation.py
(Sub-directory) Documents- conatins individual documents generated by index_creation.py
index_creation.py- Primary code file, constructs the vector space based index
query_processer.py- Python file containing functions for query processing
test_queries.py- takes query as input and returns the top 10 retrieved documents
WordNetImprovement.py- conatains the WordNetimprovement class with methods to perform query relaxation (part 2)
README.md
requirements.txt

Procedure

Make sure that files and sub-directories are in order specified above, the subdirectories Documents and Storage are created during runtime.
All the dependencies are listed in requirements.txt. Make sure that requirements.txt is in the directory containing the code files and its your present working directory, execute the following on command line (Windows)

$ pip install -r requirements.txt

After the installation, run the following code in a python interpreter

>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')

Firstly run the index_creation.py in a python IDE, this parses through the text corpus and creates the index along with .pkl files to be used during processing and separate documents based on their doc_id in the parsed corpus.
We are now ready to run the test_queries.py to retrieve the documents.
- Prints the top 10 documents with their scores and title.
- open_web arguement allows user to open the search results in browser window
- use_zones implements Zonal Indexing, suggested as an inprovemnt for ranking and retrieval
- enable_query_relaxation implements query relaxation using WordNet synsets
  - input 1 as arguement for hypernym based relaxation
  - input 2 as arguement for synonym based relaxation

Note: Since we are using an multiple data structures to represent the corpus, passing a single location for the inveted index in test_queries.pyfile won't cut it and passing multiple locations is not efficient. Therefore, we have stored and read the required files from their default locations

All operational deails are commented out or are present as docstrings in the code files. All discussions are present in the assignment report.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Space based Document Retrieval system

Files

Procedure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Text_corpus		Text_corpus
LICENSE		LICENSE
README.md		README.md
WordNetImprovement.py		WordNetImprovement.py
index_creation.py		index_creation.py
presentation.ipynb		presentation.ipynb
query_processer.py		query_processer.py
requirements.txt		requirements.txt
test_queries.py		test_queries.py

License

manan-paneri-99/Vector-Space-based-Document-Retrieval-system

Folders and files

Latest commit

History

Repository files navigation

Vector Space based Document Retrieval system

Files

Procedure

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages