Skip to content

A machine learning project written in Python to perform authorship identification on sample sentences from three horror authors.

License

Notifications You must be signed in to change notification settings

shogo54/author-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author Identification

This is a machine learning program to perform authorship identification on sample sentences from three horror authors HP Lovecraft (HPL), Mary Wollstonecraft Shelley (MWS), and Edgar Allen Poe (EAP).

The project was created for a sample contest on Kaggle

Implementation Details

The classifier is based on Naive Bayes, and will feed the training data and predict each of the unknown sentences.

The project includes the training and testing data files. The training data is labeled with the author of each sentence, while the test data is not labeled.

The followings are the feature vectors that the program uses for prediction.

  1. bag of words (I put all of the training texts into lists labeled with author and create bag of words based on it. Then read each test text and classify it with bag of words)
  2. parts of speech (syntax features)
  3. lexical features (average number of words per a sentence, sentence length variation, and lexical diversity)
  4. punctuation features (commas, semicolons, and colons per a sentence)

Prerequisites

To run the code, make sure that you install all packages that the project is using. The project is using the following packages:

To ensure that you install the packages above, run the following command on your console:

python -m pip install --user numpy nltk sklearn

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Shogo Akiyama - [email protected]

Project Link: https://github.com/shogo54/author-identification

Acknowledgements

The implementation is inspired from the following article:

Future References

in the future, I can apply a neural network based approach to this project. The artilces bellow might be useful:

About

A machine learning project written in Python to perform authorship identification on sample sentences from three horror authors.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published