A simple application to generate tags for the given text (document) using tf-idf.
Implementation of term frequency - inverse document frequency
Tf-idf stands for term frequency - inverse document frequency which is used in text mining and information retrieval system to evaluate how important a word is in a document.
The importance is directly proportional to the number of times a word appears in the document but is also weighted down by the frequency of the word in the whole corpus.
term-frequency (tf) of a term/word t is actually given by:
tf = (number of times the term t appears in a document ) / (total number of terms in the same document)
inverse document frequency (idf) mesaures how much rare a term is throughout the multiple documents. That is, more the rareness of a term, the greater we tend to value the rareness.
idf = natural_logarithm[ (total number of documents) / (number of documents having the term t) ]
Here, natural_logarithm
is the logarithmic function with base e.
The only dependency I have used here (Used from June 25, 2018 and onwards) is nltk
library for lemmatization purpose.
pip install nltk
git clone https://github.com/NISH1001/tag-generator
The tagger uses the documents available inside data/documents/
So, make sure there is atleast one document inside this path.
Run the tagger.py script. [I am assuming you are using python3]
python3 tagger.py
This will generate default 5 tags. Additional number can be supplied for getting more tags.
python3 tagger.py 7
If you want to test a custom text file, put that inside the data folder as data/test.