Skip to content

Riptwitter was trending on twitter when Elon Musk took charge. Lets collect tweets under the hashtag using Twitter API and analyze the tweet sentiment

Notifications You must be signed in to change notification settings

nnvij/Twitter-Sentiment-Analysis-BigData

Repository files navigation

Twitter-Sentiment-Analysis--BigData

Introduction:

  • Sentiment analysis is used to analyse text data inorder understand the underlying sentiment (positive or negavtive).
  • Sentiment analysis uses Natural language processing(NLP) and machine learning to determine emotional intent behind a communication.
  • Twittershutdown, Riptwitter was trending when Elon Musk took charge and hundreds of twitter employees send in their resignations.
  • This project will perform Sentiment anlaysis on tweets collected for hastags #Twittershutdown, #Riptwitter, #Elon Musk and so on.

Problem Statement:

  • Collect tweets using Twitter Api by launching an AWS EC2 instance, stream the tweets using Kinesis firehose and store the data in AWS S3 bucket.
  • Create a binary classification model to classify sentiment of each tweet (positive or negative), label= sentiment(0>negative, 1>positive) .
  • Create a Quicksight dashboard for the data collected and also predictions from the classification model.

Tools used:

  • AWS, Twitter Api, Amazon Kinesis firehose, Pyspark, Amazon Quicksight, Databricks

Data

Data Collection:

  • 399333 tweets were collected using Twitter Api and stored in AWS S3
  • Using Databricks environment connect to S3 bucket and mount the data by creating a spark session. image image

Data preprocessing:

  • Created a pyspark dataframe object twitter data.
  • Checked for null values and drop rows with Null values.
  • Converted create_at to datetime column.
  • Used regular expression to clean the tweet, location columns. image.
  • Textblob which is a library in python for text analysis can be used to assign sentiment for each tweet.
  • Created a column Sentiment which will have values 0 if a tweet has nagative sentiment and 1 for positive sentiment.
  • After cleaning we have 135,083 tweets out of which 45,760 tweets were with positive sentiment and 89,323 were tweets with negative sentiment.

Model:

Feature Engineering:

  • Using library Tokenizer convert tweet column to lowercase and split it by white spaces, outputColumn="tokens"
  • Remove stopwords from tokens using library StopWordsRemover,outputColumn="filtered" .
  • Convert filtered tweets into matrix of token counts using CountVectorizer library,outputColumn="cv" .
  • Inverse document frequency (IDF) library will check for relevant words in the tweet and remove sparse words, outputcolumn = "1gram_idf".
  • Ngram (n=2) library is feature transformer that converts the input array of strings into an array of n-grams, outputcolumn= "2gram".
  • HashingTF will map a sequence of terms to their term frequencies using the hashing trick, numFeatures=20000,outputcolumn= "2gram_tf".
  • Again perform IDf to remove sparse terms, outputColumn="2gram_idf"
  • VectorAssembler will merges "1gram_idf", "2gram_tf" columns into a vector column="rawFeatures"
  • ChiSqSelector will select categorical features from rawFeatures, outputCol="features" and reduce the number of features to 16000

Model Development and Evaluation:

  • Data was split into 90% train and 10% test data.
  • Sentiment column is the label. 0 > negative sentiment, 1> positive sentiment
  • We tried RandomforestClassifier and Logisticregression models to classify if the tweet in the test data is positive or negative
  • With RandomForestClassifer we acheived 66% accuracy and 72.87% Roc-Auc score
  • Classification report for RandomForestClassifer as follows: image
  • LogisticRegression gave us an accuracy score of 90.425 and Roc-Auc score of 92.83
  • Classification report for LogisticRegression as follows: image

LogisticRegresssion model gave us better accuracy, the predictions are saved back to AWS S3 bucket

QuickSight Dashboard

Tweets post data preprocessing:

  • 66% of the tweets were with negative sentiment
  • Top 10 location in terms of number of tweet, location as a feature doesnot seem to be a contributor in tweet sentiment as they almost have equal percentage of both negative and positive tweets image image

Predictions:

  • from the 8.9K negative tweets, the model was able to correctly predict 8.19K tweets as tweets with negative sentiment. image

About

Riptwitter was trending on twitter when Elon Musk took charge. Lets collect tweets under the hashtag using Twitter API and analyze the tweet sentiment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published