target column: Sentiment
- convert date column to pandas datetime format
- drop duplicate tweets
- extract columns OriginalTweet, Sentiment
- clean the OriginalTweet by removing:
- emojis
- newlines and convert text to lowercase
- links and mentions
- non utf8/ascii characters
- hashtags
- & and $ present in words
- multiple spaces
- find the length of the cleaned text to see if after cleaning, there are tweets with 0 words. So drop tweets with <5 words
- tokenize the tweets using BERT tokenizer
- remove tweets with token lengths >80 as they don't appear to be in English
- convert Sentiment column to have only 3 columns: Negative: 0, Neutral: 1, Positive: 2
Balanced dataset for a model would generate
- higher accuracy models
- higher balanced accuracy
- balanced detection rate We use randomoversampler:
- Randomly duplicate examples in the minority class.
We convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1.
- Precision : TP/TP+FP
- Recall : TP/TP+FN
- F1 Score : 2 * Precision * Recall / (Precision + Recall)
-
Naive Bayes Classifier: F1 Score: 70%
-
Logistic Regression Classifier: F1 Score: 78%
-
Random Forest Classifier: F1 Score: 67%
-
K Nearest Neighbours Classifier: F1 Score: 47%
optimizer: Adam
learning rate: 1^-5
decay rate: 1^-7
epochs: 1
loss: categorical cross entropy
accuracy: categorical accuracy
input layer: 128 neurons
output layer: 3 neurons
activation: softmax
optimizer: Adam
learning rate: 1^-5
decay rate: 1^-7
epochs: 1
loss: categorical cross entropy
accuracy: categorical accuracy
input layer: 128 neurons
output layer: 3 neurons
activation: softmax
- Precision : TP/TP+FP
- Recall : TP/TP+FN
- F1 Score : 2 * Precision * Recall / (Precision + Recall)
- Confusion Matrix
- BERT: F1 Score: 85%
- RoBERTa: F1 Score: 88%