Sense Exploring of Emojis with Word2Vec Model and Using Emojis to Modify Short-text Sentiments Classification
- period: 2022-0415-2022-0528
- filter: retweet or media
- querry: tweeets that contain at least one concerned emoji
Our dataset consists of Tweets which contain at least one of the following emoji: The baseline selection of common-used emoji refers to Ian D. Wood, Sebastian Ruder, 2016
The tag of emoji is besed on human intuition accroding to the original paper, instead of meaning in context.
For English tweets, we follow the following steps:
1. remove url
2. remove user names
3. remove punctuations
4. remove stopwords
5. lower the words
6. lemmatization
7. keep only english characters and emojis
For Chinese tweets, we:
1. remove url
2. remove user names
3. cut the words
4. remove stopwords (en & zh)
5. transform to simplified Chinese
6. remove punctuations (en & zh)
7. remove sensitive Chinese words
8. remove non-chinese words
And, we generate 3 new coloumns which:
- remove all emojis
- include all emojis
- only include concerned emojis
For English tweets, we follow the following steps:
- remove url
- remove user names
- remove punctuations
- remove stopwords
- lower the words
- lemmatization
And, we generate 3 new coloumns which:
- 1) remove all emojis
- 1) remove non-english words
- 2) all emojis
- 3) only concerned emojis
For Chinese tweets, we follow the following steps:
- remove url
- remove user names
- cut the words
- remove stopwords (en & zh)
- transform to simplified Chinese
- remove punctuations (en & zh)
- remove sensitive Chinese words
- remove english words
And, we generate 3 new coloumns which:
- 1) remove all emojis
- 2) all emojis
- 3) only concerned emojis
- plot the most frequent emojis
- Word2Vec word embedding with only emoji
- word embedding with both emoji and words