This repo includes processing script we have for the each dataset we included in TID-8 datasets.
Note that here we explicitly include the annotations one annotator has on other examples to help the later modeling process. But in TID-8 datasets, such information is omitted for simplicity.
If you want to use a cleaned version of these datasets, you may go to TID-8 datasets directly.
Otherwise, you may download each dataset and rename and process them accordingly. Here are the links we used to download these raw datasets:
-
Commitmentbank dataset: https://github.com/mcdm/CommitmentBank
-
FriendsQIA dataset: https://github.com/friendsQIA/Friends_QIA, specifically at https://github.com/friendsQIA/Friends_QIA/tree/main/Data/Friends_data
-
GoEmotions dataset: https://github.com/google-research/google-research/tree/master/goemotions
-
HS-Brexit dataset: https://le-wi-di.github.io/, specifically at https://github.com/Le-Wi-Di/le-wi-di.github.io/blob/main/data_post-competition.zip
-
Humor dataset: https://github.com/ukplab/acl2019-GPPL-humour-metaphor, specifically at https://github.com/UKPLab/acl2019-GPPL-humour-metaphor/blob/master/data/pl-humor-full/results.tsv
-
MultiDomain Agreement dataset: https://le-wi-di.github.io/, specifically at https://github.com/Le-Wi-Di/le-wi-di.github.io/blob/main/data_post-competition.zip
-
Pejorative dataset: https://github.com/t-davidson/hate-speech-and-offensive-language, specifically at https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master/data
-
Sentiment dataset: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/F6EMTS
-
Toxicity ratings dataset: This data is not publicaly available and there are many constraints there. If you are interested, please contact the authors of this dataset to access the data.