Skip to content

andrea-t94/SentimentClassifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-tuning RoBertA model with Twitter Sentiment data

In this repository you can find a set of tools to automatically fine-tune a RoBertA base model (Hugging Face) with Twitter sentiment data (more info on Kaggle.

High level overview

The overarching goal is to make the LLM to be able to correctly classify good or bad Twitter sentiment. In order to achieve that, the training is comprised of two steps:

  • Fine-tuning Masked Language Modelling task (more info see Hugging Face)
  • Fine-tuning on classification task

Please note: both tasks are trained on the same dataset

Code Base

The code is comprised of three main sections:

  • the main one: which contains all the necessary scripts to make it run on Ubuntu 22.04 EC2 machine (with GPU)
  • notebooks: contains some exploration made with Jupyter notebook (TO BE DELETED)
  • training_language_model: contains all the relevant code used to create the Docker images of the two fine-tuning steps

How it works?

Considering using this on EC2 machine (GPU enabled) with Ubuntu 22.04. First we need to bootstrap our environment:

  • run prepare_docker.sh
  • run prepare_docker_compose.sh
  • run install_NVIDIA_docker_toolkit.sh
  • run install_ubuntu_NVIDIA_drivers.sh

Then we can call Docker compose:

  • generate a .env* file
  • run 'docker compose up service_name

The two services are generated by two docker images:

Please note: you can build new images by using the code inside training_language_model folder.

.env file generation

Generate an .env file with the below information: INPUT NECESSARY

  • HF_USER= your Hugging Face username, to push the model to Hugging Face Hub
  • HF_TOKEN= your Hugging Face token, to push the model to Hugging Face Hub NEED TO CHANGE ONLY IF YOU WANT TO CHANGE THE CODE BASE (new model, different mount point, new dataset)
  • DIRPATH=data
  • MODEL_VERSION_MLM=roberta-fine-tuned-twitter
  • MODEL_VERSION_CLF=roberta-fine-tuned-twitter-sentiment
  • DATASET_VERSION=TwitterSentiment140

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published