Skip to content

A python project which uses data scraped from Wikipedia's List of Dinosaur Genera to train a neural network that generates new dinosaur names. The new names are then tweeted by a bot using the Twitter API.

Notifications You must be signed in to change notification settings

jackmcarthur/dino-name-tweetbot

Repository files navigation

dino-name-tweetbot

A python project which uses data scraped from Wikipedia's List of Dinosaur Genera to train a neural network that generates new dinosaur names. The new names are then tweeted by @fakedinonames using the Twitter API.

Demonstration

Real Dinosaurs Fake Dinosaurs
Epicampodon Achinops
Clarencea Rhaeosaurus
Lusovenator Iscrodromeus
Lufengosaurus Unesasaurus
Huanghetitan Ormalong
Albertavenator Ugansaurus
Masiakasaurus Arshansaurus
Adeopapposaurus Nenongo
Beishanlong Haplodon
Yamanasaurus Erisanosaurus

Network design and implementation

In the file dinobot_network_gen.ipynb, Tensorflow and Keras were used to create a small character-based recurrent neural network, a type of neural network commonly used for short-string text generation. The neural network was trained with dinonames.csv, a list of 1648 dinosaur names scraped from the Wikipedia list of dinosaur genera using the BeautifulSoup package, allowing it to produce realistic dinosaur names of its own. Various components of the neural network, including layer types, loss functions, and optimization methods were tweaked until it produced reasonable facsimiles of the training data. Training was done in a notebook hosted by Google Colab, which provides a free GPU that greatly speeds up runtime.

The generated names are then filtered using the stringdistance package, ensuring that all outputs are at least two single-character edits away from any real dinosaur names (dinosaur trademark law is not to be messed with), and they are finally posted to Twitter using the Tweepy package in dino-bot.py.

The network itself is stored in dino_model.h5. It can easily be loaded into Keras, and the following code is all that is necessary to interact with the model (though a more complete scaffolding code can be found in the notebook dino-bot.ipynb).

from tensorflow import keras
model = keras.models.load_model('dino_model.h5')

# number of available characters = 26 lowercase letters + '\n' = 27
x = np.zeros(27)
# array of probabilities associated with the first letter of the name, given a blank input:
model.predict(x)

Naming function

The function dino-bot.checked_name() returns a tuple name, attempts of the name generated using dino_model.h5, and the number of attempts required to obtain a name with an acceptably high stringdistance min_dist from all existing dinosaur names in dinonames.csv. By default, min_dist is 2, so that "Dipladocus" and "Dilodocus" both get rejected for being 1 string edit from "Diplodocus."

for i in range(10):
    print(checked_name(model)[0],end='')
Euisabia
Ultacephodon
Inkissaceus
Rabiatitan
Edmorosaurus
Unbyrosaurus
Arshanglangosaurus
Erbosaurus
Sindasaurus
Epidemtrus

When min_dist is set to 0, the first name generated by the neural network is accepted every time. In this case, it produces an existing dinosaur name about 20% of the time. As min_dist is increased, the names become more distinct from existing names, but the number of attempts to generate a name increases significantly:

for i in range(4):
    num_attempts = [checked_name(model, min_dist = i)[1] for j in range(20)]
    print('min_dist =', min_dist, ', average number of attempts = ', sum(num_attempts)/len(num_attempts))
min_dist = 1 , average number of attempts =  1.0
min_dist = 2 , average number of attempts =  1.3
min_dist = 3 , average number of attempts =  1.85
min_dist = 4 , average number of attempts =  2.05

Names can also be restricted to begin with a certain character with the optional argument start_char:

for i in range(5):
    print(checked_name(model, start_char='g')[0],end='') # or 'G', both are accepted
Gyoodon
Gyongosaurus
Gaviraptor
Gdilosaurus
Gyperodon

This neural network project pulls together web scraping, machine learning, file writing, and API interfacing in Python, making it an excellent programming exercise and an elegant showcase of many of the practical abilities I've developed. In the future I hope to add temperature controls to this neural network to vary the predictability of the model's output, and eventually add more features to the Twitter posts, such as an image or theorized country of origin.

About

A python project which uses data scraped from Wikipedia's List of Dinosaur Genera to train a neural network that generates new dinosaur names. The new names are then tweeted by a bot using the Twitter API.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published