In this report, we present a study of sentiment analysis on Twitter data, where the task is to predict whether the smiley contained in the tweet is happy :) or sad :(. We experimented with today's most common solutions, such as text preprocessing and supervised classification techniques. We mixed-and-matched our algorithms to evaluate how it influenced the accuracy of our predictions. Our predictor currently obtains an accuracy of: 0.85
In order to run the project you will need the following dependencies installed:
-
Anaconda3 - Download and install Anaconda with python3
-
Scikit-Learn - Download scikit-learn library with conda
$ conda install scikit-learn
-
$ conda install pandas
-
NLTK - Download all packages of NLTK
$ python $ >>> import nltk $ >>> nltk.download()
download all packages from the GUI
- Matplotlib - Optional - Needed to see the beautiful plots on our notebook!
$ pip install matplotlib
-
Train and Test Data
Download all files here in order to train and test the models and move them in
data/twitter-datasets/
directory. -
constants.py - all constants used, such as file names and label values.
-
data_cleaning.py - methods used for data cleaning.
-
data_exploration.py - methods used to explore the data, like exctracting and countring hashtags.
-
data_loading.py - methods used for data loading and DataFrame creation.
-
prediction.py - methods to classify (BoW, TD-IDF), to cross-validate and to create the submission csv.
-
run.py - main class, uses above functions to generate best available submission.
-
utils.py - log utility
-
Run_All_Combinations.ipynb - notebook we used to find best parameter combinations and to generate plots.
-
Data_Exploration.ipynb - notebook we used to explore the data to find out what cleaning methods we needed to apply.
In order to produce the same submission corresponding to our crowdAI ranking, just run the following command:
$ python3 run.py
The submission can be found in the file preds/submission_clean_tweet.csv
The leaderboard can be found on crowdAI.
Our submission - username baraschi / submission id 24870 - can be found here.