This is my submission for the Jigsaw Multilingual Toxic Comment Classification Kaggle competition. This is my initial submission. I will be modifying and trying to improve my score. If you're attempting this for the first time, feel free to fork this repo and make modifications to the code. Do let me know if you're able to improve the score. Running this on Kaggle will give you a result accuracy of ~91%
This competition is based on Conversation AI, an initiative of Jigsaw and Google. The main area of focus is creating machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful, or otherwise likely to make someone leave a discussion.
You can access all datasets and details of each file from here
All the files included in src
are sufficient to train and test the model.
The Jupyter notebooks Jigsaw-multilingual-nikhiljohn.ipynb
and jigsaw-inference-nikhiljohn.ipynb
are my Kaggle notebooks, one for training and one for inference respectively. Feel free to use them too. If you use this, make sure to use the TPUs provided by Kaggle. If you need a guide on how to work with TPUs, use this link. It's a video tutorial by Abhishek Thakur, a data scientist I really admire.