Skip to content

Latest commit

 

History

History
54 lines (31 loc) · 2.5 KB

Readme.md

File metadata and controls

54 lines (31 loc) · 2.5 KB

Categorizing Industry on Description

Business

Problem Statement :

Categorize the Industry in different sectors like Banks, Healthcare,Oil and Natural Gas

I have used LDA to categorize industries based on their description.

What the hell is this LDA now ?

LDA : Latent Dirichlet Allocation is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

Jack Ma

Road map

1- I observed that all the columns that are in the dataset , do not vary much as per the labels so they are not useful for predictive model

2- Data feature Business Description vary with Industry Classifications so I have chosen this feature.

3- Now If I would be able to get the topic from the business description that could help me tag that industry in particular class

4- Dataset is imbalanced(observed via barplot)

5- Cleaning and tokenizing text :

a- Removing special characters

b- Removing stopwords

c- using gensim.utils.simple_preprocess which convert a document into a list of lowercase tokens, ignoring tokens that are too short or too long.

d- Created bigram and trigram a blog by Dipanjan

video by me explaining language model

6- Created a dictionarygensim.corpora.dictionary which encapsulates the mapping between normalized words and their int

7- Build LDA model

8- Create topics in LDA model and also view the weightage of keywords in each topic

9- Using pyLDAvis for interactive visualization

10- Achieved coherence score : 0.47945523115286265

11 - Finded optimal number of topics depending on the coherence score

12 - Used LDA Malletwhich normally gives better quality of topics

13- Achieved coherence score: : 0.5168848885376561)