twitter-ml

master:

develop:

Dependencies

tensorflow (1.6.0)
tweepy (3.6.0)
keras (2.1.5)

Credit

Character Prediction Adapted from: tensorflow-rnn-shakespeare

Data Collection

First of all if you want to get started with the machine learning part you can skip data collection by downloading twitter text data here. Drop that file into src/shared_data, preprocess the data, and start messing with machine learning.

To start collecting twitter text, run collection_program.py :

$ python3 src/data_collection/collection_program.py

For twitter text collection pick the EngTextStreamTransformer by typing in 2. For the filter, using the will give you a steady stream of tweets. The sample size is the number of tweets to collect, 50000000 would be roughly 5GB of data and will take a long time to collect, but you can stop the program at any time.

Here are all the options that I use for running collection on a small server:

$ python3 src/data_collection/collection_program.py
PICK STREAM TRANSFORMER TYPE:
     0 FUCTStreamTransformer
     1 FHCTStreamTransformer
     2 EngTextStreamTransformer
ENTER CORRESPONDING NUMBER: 2
ENTER FILTER: the
ENTER SAMPLE SIZE: 50000000
ENTER DURATION IN HOURS: 720
ENTER BUFFER SIZE: 25000
SHOULD PRINT ENTRY (0 or 1): 0

Data Preprocessing

For using these examples replace ../shared_data/THE\ STREAM.csv with the path to your text file.

Preprocessing for character prediction:

$ cd src/character_prediction # needs to run in the character_prediction directory
$ ./create_text.sh ../shared_data/THE\ STREAM.csv

Preprocessing for word embeddings:

$ cd src/word_embeddings # needs to run in the word_embeddings directory
$ ./create_data_package.sh ../shared_data/THE\ STREAM.csv

A convenient script for preprocessing for both cases:

$ cd src # neeeds to run in the src directory
$ ./preprocessing.sh shared_data/THE\ STREAM.csv

Machine Learning

Character Prediction

After collecting and preprocessing twitter text data, run the character prediction training program with:

$ cd src/character_prediction # needs to run in the character_prediction directory
$ ./run_rnn_train.sh data/THE\ STREAM/\*.txt

Replace data/THE\ STREAM/\*.txt with the paths to the all the text batch files.

After a bit of training you can watch the model generate text by running:

$ cd src/character_prediction # needs to run in the character_prediction directory
$ python3 char_rnn_play.py

To launch tensorboard run the command:

$ cd src/character_prediction # needs to run in the word_embeddings directory
$ tensorboard --logdir=log

Word Embeddings

After collecting and preprocessing twitter text data, run the random or not word embeddings neural net training program with:

$ cd src/word_embeddings # needs to run in the word_embeddings directory
$ ./run_random_or_not_nn.sh data/THE\ STREAM # relative path to the data package

Replace data/THE\ STREAM with the relative path to your data package directory

To see the word embeddings in tensorboard run the command:

$ cd src/word_embeddings # needs to run in the word_embeddings directory
$ tensorboard --logdir=log

Name		Name	Last commit message	Last commit date
Latest commit History 292 Commits
.github/workflows		.github/workflows
src		src
tests		tests
tml		tml
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
requirements.txt		requirements.txt
sample_data.py		sample_data.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

twitter-ml

Dependencies

Credit

Table of Contents

Data Collection

Data Preprocessing

Machine Learning

Character Prediction

Word Embeddings

About

Releases 1

Packages

Contributors 2

Languages

kobejean/twitter-ml

Folders and files

Latest commit

History

Repository files navigation

twitter-ml

Dependencies

Credit

Table of Contents

Data Collection

Data Preprocessing

Machine Learning

Character Prediction

Word Embeddings

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages