Uses type/token ratio to calculate lexical diversity.
Pre-processing includes tokenising input, removing stopwords and using nltk's Porter Stemmer to obtain word stems.
Run:
python3 lexical_diversity_calculator.py -n SampleTexts/EdSheeranLyrics.txt
Output
EdSheeranLyrics.txt lexical diversity: 0.2112
Finds proportions of adjectives, verbs, nouns and adjectives in a text. Categorises remaining types as 'other'.
Preprocessing involves tokenisation of input and removal of stopwords.
Uses nltk's part of speech (POS) tagger to assign parts of speech to input text tokens. Given that nltk's POS tagger was trained using the Treebank Corpus it uses the Treebank tag set. This script will map the Treebank tags to WordNet tags before giving the proportions as output.
Run:
python3 word_proportions.py -n SampleTexts/GulliversTravels.txt
Output:
Adjectives: 7.75 %
Verbs: 17.18 %
Nouns: 22.76 %
Adverbs: 5.6 %
Other: 46.7 %