Natural Language Procession For Text Classification and Machine learning
Text Classification is one model of supervised machine learning task with a labelled dataset containing text documents and their labels is used for train a classifier.
Dataset Preparation step which includes the process of loading a dataset and performing basic pre-processing. The dataset is then splitted into train and validation sets.
In this step, raw text data will be transformed into feature vectors #and new features will be created using the existing dataset. We will implement the following different ideas in #order to obtain relevant features from our dataset.
2.1 Count Vectors as features
2.2 TF-IDF Vectors as features
-Word level
-N-Gram level
-Character level
2.3 Word Embeddings as features
2.4 Text / NLP based features
2.5 Topic Models as features
Model Building step in which a machine learning model is trained on a labelled dataset.
-Naive Bayes Classifier
-Linear Classifier
-Support Vector Machine
-Bagging Models
-Boosting Models
-Shallow Neural Networks
-Deep Neural Networks
-Convolutional Neural Network (CNN)
-Long Short Term Modelr (LSTM)
-Gated Recurrent Unit (GRU)
-Bidirectional RNN
-Recurrent Convolutional Neural Network (RCNN)
-Other Variants of Deep Neural Networks
-
accuracy: proportion of test results that are correct
-
sensitivity: proportion of true +ve identified
-
specificity: proportion of true -ve identified
-
positive likelihood: increased probability of true +ve if test +ve
-
negative likelihood: reduced probability of true +ve if test -ve
-
false positive rate: proportion of false +ves in true -ve patients
-
false negative rate: proportion of false -ves in true +ve patients
-
positive predictive value: chance of true +ve if test +ve
-
negative predictive value: chance of true -ve if test -ve
-
precision = positive predictive value
-
recall = sensitivity
-
f1 = (2 * precision * recall) / (precision + recall)
we will use different ways to improve the performance of text classifiers.
Using ELI5