Author: Jerin Easo Thomas
Date: Spring 2023
Affiliation: Luddy School of Informatics, Computing and Engineering, Indiana University Bloomington
Contact: [email protected]
Sexism comments on social media propagate harmful stereotypes and gender-based prejudice, impacting users psychologically and fostering discrimination. This project aims to identify sexism in Twitter movements using machine learning and deep learning approaches. By analyzing sexist speech, we seek to create safer online environments, combat gender-based discrimination, and gain insights into societal prejudices.
This study focuses on two key questions:
- How reliable and efficient are machine learning methods in spotting sexism in Twitter movements?
- What are the source intentions behind tweeting sexist comments?
Various methodologies were employed in this study:
- Valence Aware Dictionary and Sentiment Reasoner (VADER): Used for sentiment analysis, especially on social media text.
- Neural Network: Deep learning model structured to mimic the human brain's organization.
- Robustly Optimized BERT Pretraining Approach (RoBERTa): Transformer-based neural network architecture pre-trained on a large text corpus.
- Pytesseract: Python library for extracting text from image-based data.
- Google Translate API: Utilized for language translations.
Data was collected from Twitter movements #MeToo, #8M, and #Time'sUp using the EXIST dataset. The dataset includes tweets in English and Spanish, annotated for sexism. Approximately 7900 rows of data were collected, comprising tweet comments, language, annotators, user details, and labels.
The analysis proceeded in three stages:
- Data Gathering and Preprocessing: Included data analysis, preprocessing, and text normalization techniques.
- Model Building and Evaluation: Utilized VADER for sentiment analysis, neural networks, and RoBERTa for tweet classification.
- Tweet Classification and Data Visualization: Employed models to classify tweets as sexist or non-sexist, analyze source intentions, and visualize the results.
Key findings from the analysis include:
- Distribution of Classified Tweets: Spanish tweets exhibited higher sexism rates compared to English tweets.
- Effectiveness of Image-based Data Classification: Models effectively distinguished sexist and non-sexist tweets from image-based data.
- Source Intention Distribution: Majority of English tweets showed 'Direct' source intention, while Spanish tweets displayed varied intentions.
- Most Often Used Words: Word clouds revealed specific words more common in sexist tweets, providing insights into language usage.
The RoBERTa-large model emerged as the most accurate for tweet classification, demonstrating its efficacy in identifying sexism and detecting source intentions. This study provides valuable insights into combating sexism on social media platforms and fostering inclusive online communities.
- Thomas Davidson, Dana Warmsey, Michael Macy, Ingmar Weber. "Automated Hate Speech Detection and the Problem of Offensive Language." Link
- Shimi Gersome and Jerin Mahibha. "Sexism Identification In Social Media Using Deep Learning Models." Link
- EXIST: sEXism Identification in Social Networks. Link
- Francisco Rodriguez-Sanchez, Jorge Carrillo-de-Albornoz, and Laura Plaza. "Automatic Classification of Sexism in Social Networks: An Empirical Study on Twitter Data." Link
- Regina Konig and Angela Heine. "Learning to detect sexism: An evaluation of the effects of a brief video-based intervention using ROC analysis." Link
- Google Translate API. Link
- Training and evaluation with the built-in methods. Link
- Sentiment Analysis using VADER. Link