The Hive Mind Queen is a machine learning-driven project designed to predict the popularity of user-generated cards on Reddit. By gathering and processing card data, including card attributes and ability text, the project aims to uncover patterns and relationships that contribute to a card's success. Utilizing state-of-the-art data preprocessing and natural language processing techniques, The Hive Mind Queen provides valuable insights into the essential factors that make a card resonate with the community, ultimately helping card creators design more appealing and engaging content.
Script for gathering card data and associated Reddit votes. This step involves fetching Reddit votes from a PostgreSQL database and using the card game server's API endpoints to retrieve card attributes. The combined data is then processed and stored in a CSV file.
py scraper.py
Column Name | Description | Type | Example |
---|---|---|---|
votes | Number of votes the card has on reddit | int | 4 |
timestamp | Time when the card was posted | int | 1580000000 |
name | Name of the card | str | 'Cat With Frying Pan' |
type | Type of the card | str | 'Unit' |
affinity | Affinity of the card | str | 'Neutral' |
rarity | Rarity of the card | str | 'Common' |
tribes | Tribes of the card | str | 'Toon Cat' |
realm | Realm of the card | str | 'Carachar' |
ability_text | Ability text of the card | str | 'Summon: Deal 1 damage.' |
cost | Cost of the card | int | 1 |
hp | Health of the card | int | 1 |
atk | Attack of the card | int | 1 |
Script for exploring the data. This step involves analyzing the dataset to gain insights and discover patterns. Look out for outliers and missing values.
jupyter-lab
Script for preparing the data. This step involves cleaning the data and transforming it into a format that can be used by the model.
py preparator.py
The data is cleaned by filling NaN values and removing unnecessary columns.
The data is converted into a binary classification problem. This is done by creating a new column popular
and assigning 1
or 0
to the column depending on whether the card has more than 2 votes. A card with more than 2 votes is considered popular.
A method to convert categorical variables into binary vectors by creating a separate column for each category. Each row has a 1 in the column corresponding to its category and 0 in all other columns.
Suitable for type
, affinity
, and rarity
columns due to a small number of unique values.
A technique to transform categorical variables into numerical data by mapping categories to a fixed number of columns using a hashing function. Suitable for high-dimensional data with many categories, as it reduces dimensionality and computational costs but can introduce collisions.
Suitable for realm
and tribes
columns due to a large number of unique values.
Scaling is a process of converting numerical data into a range of values. This is done using the StandardScaler, which standardizes features by removing the mean and scaling to unit variance.
Applied to the cost
, hp
, and atk
columns.
A method to convert text data into numerical vectors that capture semantic relationships between words or phrases. The script uses the TF-IDF (Term Frequency-Inverse Document Frequency) technique to convert the ability_text
column into numerical vectors.
During the data preparation stage for training, the script saves the hasher, scaler, and TF-IDF vectorizer objects, along with the feature names, to ensure consistency when preparing new data for predictions. These saved components are used to preprocess new data in the same way as the training data, allowing for a seamless integration with the trained model.
The following components are saved:
hasher.pkl
: The FeatureHasher object used for hashingrealm
andtribes
columns.scaler.pkl
: The StandardScaler object used for scaling thecost
,hp
, andatk
columns.tfidf_vectorizer.pkl
: The TfidfVectorizer object used for embedding theability_text
column.feature_names.csv
: A CSV file containing the feature names of the training data after preprocessing.
These components are stored in the train_misc
directory and loaded when preparing new data for predictions.
Script for training the machine learning model. This step involves loading and preprocessing the data, training a RandomForestClassifier, and evaluating its performance.
py trainer.py
The data is split into a training set (80%) and a testing set (20%).
A RandomForestClassifier model is instantiated with a random state of 42 for reproducibility. The hyperparameters are tuned using RandomizedSearchCV with cross-validation.
Feature importances are calculated, and the top 10 most important features are selected for the final model.
The model is evaluated on the training and testing sets. The evaluation metrics include accuracy, precision, recall, and F1-score. Bootstrap resampling is used to estimate the mean and standard deviation of the test F1-score.
Bootstrap resampling is used during the model evaluation stage to estimate the mean and standard deviation of the test F1-score. This technique involves creating multiple resampled test datasets by sampling with replacement from the original test set. The model is then evaluated on each of these resampled datasets, and the F1-scores are calculated. The mean and standard deviation of these F1-scores provide an estimate of the model's performance and its variability when applied to new, unseen data. By using bootstrap resampling, we can better understand the model's generalization ability and account for the randomness in the dataset.
Script for predicting the popularity of new cards using the trained machine learning model.
py predictor.py
Use the prepare_data()
function from the "Preparation" section with is_training=False
to preprocess the new card data.
Use the trained model to make predictions for the new card data. Output is printed in the console.
- Python 3.10
- pip
- JupyterLab
- Clone the repository
- Install the required packages
pip install -r requirements.txt
- Copy the
.env.example
file and rename it to.env
- Fill in the required fields in the
.env
file