Skip to content

adibyte95/Twittter-sentiment-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HitCount

Twittter sentiment analysis

Topic - to take twitter tweets and classify the tweet as positive(reflecting positive sentiment) and negative(reflecting negative sentiment)

1. DataSet

I have used a kaggle data set Click here
Training and Testing are done on the provided data set
data set has about 50k positive tweets and 40k negative tweets
pos_neg chart
Plot of frequency of words against the words
freq_vs_words
This graph follows zipf's law. Learn more about Zipf's law Here

2. Preprocessing

To train a classifer first of all we will have to modify the input tweet in a format which can be given to the classifier,this step is called preprossing.
It involves several steps

2.1 Hashtags

a word or phrase preceded by a hash sign (#), used on social media websites and applications, especially Twitter, to identify messages on a specific topic

2.2 URLS

used to share link to other sites in tweets. we have premanently removed links from our input text as they does not provide any information about the sentiment of the text

2.3 Emoticons

Are very much used nowadays in social networking sites.they are used to represent an human expression.Currently we have removed this emojis
how much useful emojis are for the purpose of sentiment analysis remains part of the future work

2.4 Punctuations

To remove punctuations from the input text
input - Arjun said "Aditya is a god boy"
output - Arjun said Aditya is a good boy

2.5 Repeating Character

To remove repeating characters from the text
input - yayyyyy ! i got the job
output - yayy ! i got the job

2.6 Stemming

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem".
A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".
porter stemmer is used here


3.Features

we have used
unigrams
bigrams
unigrams + bigrams
unigrams + bigrams +trigrams
as features

4.Expriments

we have used three model with above mentioned features.Note that all the results shown here are of test results which is obtained by submitting the output on the test file to kaggle.


4.1 Naive bayes Classifier

naive_bayes_classifier


4.2 Maximum Entropy Classifier

maximum_entropy_classifier

4.3 XGboost


XGBoost classifier

5. Results

For all of the classifiers shown above we can see that only using unigrams gives the least accuracy where as maximum accuracy is achieved by using Maximum entropy classifier using uni_bi+tri grams as features

Real Data

we used sentiment140 data set which contains nearly 16 lakhs tweets with positve , negative and neutral comments
dataset is also provided in the data folder
we then used pull_tweets.py file to pull data from the twitter corresponding to a paticular hashtag and then predict the results. Now we have used ME classifier with uni+bi+tri grams features and we have not tried any other models due to lack of processing power
we pulled tweets from two hashtags

  1. ramdaan
  2. SaveDemocracy
    Results are shown below

ramdan


SaveDemocracy


Note

I am open to pull requests for further modifications to this project

Future Work

  1. to use another set of features and classifiers to improve accuracy
  2. to use emoji as an feature for sentiment analysis and check how it affects the accuracy of the classifier