- tensorflow (1.6.0)
- tweepy (3.6.0)
- keras (2.1.5)
Character Prediction Adapted from: tensorflow-rnn-shakespeare
First of all if you want to get started with the machine learning part you can
skip data collection by downloading twitter text data
here.
Drop that file into src/shared_data
, preprocess the data,
and start messing with machine learning.
To start collecting twitter text, run collection_program.py :
$ python3 src/data_collection/collection_program.py
For twitter text collection pick the EngTextStreamTransformer
by typing in 2
.
For the filter, using the
will give you a steady stream of tweets.
The sample size is the number of tweets to collect, 50000000
would be roughly
5GB of data and will take a long time to collect, but you can stop the program at any time.
Here are all the options that I use for running collection on a small server:
$ python3 src/data_collection/collection_program.py
PICK STREAM TRANSFORMER TYPE:
0 FUCTStreamTransformer
1 FHCTStreamTransformer
2 EngTextStreamTransformer
ENTER CORRESPONDING NUMBER: 2
ENTER FILTER: the
ENTER SAMPLE SIZE: 50000000
ENTER DURATION IN HOURS: 720
ENTER BUFFER SIZE: 25000
SHOULD PRINT ENTRY (0 or 1): 0
For using these examples replace ../shared_data/THE\ STREAM.csv
with the path
to your text file.
Preprocessing for character prediction:
$ cd src/character_prediction # needs to run in the character_prediction directory
$ ./create_text.sh ../shared_data/THE\ STREAM.csv
Preprocessing for word embeddings:
$ cd src/word_embeddings # needs to run in the word_embeddings directory
$ ./create_data_package.sh ../shared_data/THE\ STREAM.csv
A convenient script for preprocessing for both cases:
$ cd src # neeeds to run in the src directory
$ ./preprocessing.sh shared_data/THE\ STREAM.csv
After collecting and preprocessing twitter text data, run the character prediction training program with:
$ cd src/character_prediction # needs to run in the character_prediction directory
$ ./run_rnn_train.sh data/THE\ STREAM/\*.txt
Replace data/THE\ STREAM/\*.txt
with the paths to the all the text batch files.
After a bit of training you can watch the model generate text by running:
$ cd src/character_prediction # needs to run in the character_prediction directory
$ python3 char_rnn_play.py
To launch tensorboard run the command:
$ cd src/character_prediction # needs to run in the word_embeddings directory
$ tensorboard --logdir=log
After collecting and preprocessing twitter text data, run the random or not word embeddings neural net training program with:
$ cd src/word_embeddings # needs to run in the word_embeddings directory
$ ./run_random_or_not_nn.sh data/THE\ STREAM # relative path to the data package
Replace data/THE\ STREAM
with the relative path to your data package directory
To see the word embeddings in tensorboard run the command:
$ cd src/word_embeddings # needs to run in the word_embeddings directory
$ tensorboard --logdir=log