LDA : Latent Dirichlet Allocation is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
1- I observed that all the columns that are in the dataset , do not vary much as per the labels so they are not useful for predictive model
2- Data feature Business Description vary with Industry Classifications so I have chosen this feature.
3- Now If I would be able to get the topic from the business description that could help me tag that industry in particular class
4- Dataset is imbalanced(observed via barplot)
5- Cleaning and tokenizing text :
a- Removing special characters
b- Removing stopwords
c- using gensim.utils.simple_preprocess which convert a document into a list of lowercase tokens, ignoring tokens that are too short or too long.
d- Created bigram and trigram a blog by Dipanjan
video by me explaining language model
6- Created a dictionarygensim.corpora.dictionary which encapsulates the mapping between normalized words and their int
7- Build LDA model
8- Create topics in LDA model and also view the weightage of keywords in each topic
9- Using pyLDAvis for interactive visualization
10- Achieved coherence score : 0.47945523115286265
11 - Finded optimal number of topics depending on the coherence score
12 - Used LDA Malletwhich normally gives better quality of topics
13- Achieved coherence score: : 0.5168848885376561)