This is an attempt to create a funny bot trained on a corpus of jokes. The goal of the project is to draw samples which are actually funny, from a character level LSTM model.
Read my blog post for detailed analysis and results.
This project also got coverage in TED ideas: https://ideas.ted.com/can-you-teach-a-computer-to-be-funny/
-
Short Jokes: It contains 231,657 short jokes and oneliners. For the language model, csv file of the dataset is processed and written to a seperate text file
data/shortjokes.txt
withutils/csv_to_text.py
. -
F.R.I.E.N.D.S: As a fun task, transcripts of all the episodes of TV Series F.R.I.E.N.D.S are compiled into a single text file of 4.79MB (
/data/friends.txt
) usingutils/friends.py
script. The intent is to generate funny text similar to the dialogues in the series. The script is ad-hoc as of now, so contributions are welcome.
- Python - Preprocessing the dataset.
- Torch - Language model is written in Torch.
Navigate to /src/
folder and run the following commands:
python scripts/preprocess.py --input_txt ../data/shortjokes.txt --output_h5 my_data.h5 --output_json my_data.json
th train.lua -input_h5 my_data.h5 -input_json my_data.json -model_type lstm -num_layers 3 -rnn_size 512
This will start the training session of 50 epochs on jokes dataset and checkpoints are saved in src/cv/
folder every 1000 iterations with names like cv/checkpoint_1000.t7
.
To sample data with 2000 characters from the trained checkpoint (say after 3000 iterations), run the following command:
th sample.lua -checkpoint cv/checkpoint_3000.t7 -length 2000
In case of any errors, missing dependencies or more info, refer to torch-rnn.
- Data compiled from
utils/friends.py
contains many extra headers, which were manually removed. It would be great if this task could be automated. - It has been attempted to keep the dataset as clean as possible. So, relevant additions to the jokes dataset are welcome.