Track specific hashtags or keywords in Twitter, and do real-time analysis on the tweets.
Set your own src/config.json file to get Twitter API access.
{ "asecret": "XXX...XXX",
"atoken": "XXX...XXX",
"csecret": "XXX...XXX",
"ckey": "XXX...XXX"
Modify the conf/parameters.json file to set the parameters.
{ "hashtag": "#overwatch",
"DStream": { "batch_interval": "60",
"window_time": "60",
"process_times": "60" }
}
Suggestion: Set batch_interval and window_time the multiple of 60.
Start a mongod process
sudo mongod
Run Spark jobs to train a Naive Bayes model for later sentiment analysis.
$SPARK_HOME/bin/spark-submit src/model.py > log/model.log
You can check the accuracy of the trained model in log/model.log:
>>> Accuracy
0.959944108057755
Wait for connection to start streaming tweets.
python3.4 src/stream.py
Run Spark jobs to do real-time analysis on the tweets.
$SPARK_HOME/bin/spark-submit src/analysis.py > log/analysis.log
Run the data visualization jobs.
python3.4 web/dashboard.py
- Use Twitter API tweepy to stream tweets
- Filter out the tweets which contain the specific keywords/hashtag that we want to track.
- Use TCP/IP socket to send the fetched tweets to the spark job
- Use Spark Streaming to perform the real-time analysis on the tweets
- Count the number of related tweets for each time interval
- Tweet context preprocess
- Remove all punctuations
- Set capital letters to lower case
- Remove stop words for better performance
- Find out the most related keywords
- Find out the most related hashtags
- Sentiment analysis
- Use Spark MLlib to build a Naive Bayes model
- Classify each tweet to be positive/negative
- Training examples from Sanders Analytics
- Use MongoDB to store the analysis results
The Dashboard.
Time line of related tweet counts, most related hashtags, most related keywords, the ratio of postive/negative tweets.
- Twitter Trends Analysis using Spark Streaming
- Interactive Data Visualization with D3.js, DC.js, Python, and MongoDB
- spark-twitter-sentiment
See the LICENSE file for license rights and limitations (MIT).