Multiple years' worth of Reddit Comments are publicly available in Google Cloud BigQuery. We will use a subset of the data and some SQL manipulation to create training data for predicting the score of a Reddit thread.
The Reddit sample demonstrates the capability of both linear and deep models on a Reddit Dataset.
- Make sure you follow the Google Cloud ML setup here before trying the sample. More documentation about Cloud ML is available here.
- Make sure your Google Cloud project has sufficient quota.
Install dependencies by running pip install -r requirements.txt
This sample consists of two parts:
Data pre-processing step involves reading data from Google Cloud BigQuery and converting it to TFRecords format.
Model training step involves taking the pre-processed TFRecords data and training a linear classifier using Stochastic Dual Coordinate Ascent (SDCA) optimizer, or a deep neural network classifier.
Above dataset is available in BigQuery and need to be transformed to TFRecords format for the sample code to work. Make sure to run the data through the pre-processing step before you proceed to training.
The pre-processing step can be performed either locally or on cloud depending upon the size of input data.
First you need to separate your input into training and evaluation sets. We can use one month's worth of data (December 2015) which amounts to approximately 20GB for training and we can then evaluate ourselves on a month's worth of "future" data (January 2016). Finally we can issue predictions for data even "further in the future" (February 2016).
We use the appropriate table names as the input flags --training_data
,
--eval_data
and --predict_data
respectively.
In order to run pre-processing on the Cloud run the commands below.
PROJECT=$(gcloud config list project --format "value(core.project)")
BUCKET="gs://${PROJECT}-ml"
GCS_PATH="${BUCKET}/${USER}/reddit_comments"
PREPROCESS_OUTPUT="${GCS_PATH}/reddit_$(date +%Y%m%d_%H%M%S)"
python preprocess.py --training_data fh-bigquery.reddit_comments.2015_12 \
--eval_data fh-bigquery.reddit_comments.2016_01 \
--predict_data fh-bigquery.reddit_comments.2016_02 \
--output_dir "${PREPROCESS_OUTPUT}" \
--project_id "${PROJECT}" \
--cloud
The sample implements a linear model trained with SDCA, as well a deep neural network model. The code can be run either locally or on cloud.
python -m trainer.task -h
To train the linear model (with crosses):
JOB_ID="reddit_comments_linear_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit training "$JOB_ID" \
--stream-logs \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
--config config-small.yaml \
-- \
--model_type linear \
--l2_regularization 3000 \
--eval_steps 1000 \
--output_path "${GCS_PATH}/model/${JOB_ID}" \
--raw_metadata_path "${PREPROCESS_OUTPUT}/raw_metadata" \
--transformed_metadata_path "${PREPROCESS_OUTPUT}/transformed_metadata" \
--transform_savedmodel "${PREPROCESS_OUTPUT}/transform_fn" \
--eval_data_paths "${PREPROCESS_OUTPUT}/features_eval*" \
--train_data_paths "${PREPROCESS_OUTPUT}/features_train*"
To train the deep model:
JOB_ID="reddit_comments_deep_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit training "$JOB_ID" \
--stream-logs \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
--config config-small.yaml \
-- \
--model_type deep \
--hidden_units 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062 \
--batch_size 512 \
--eval_steps 250 \
--output_path "${GCS_PATH}/model/${JOB_ID}" \
--raw_metadata_path "${PREPROCESS_OUTPUT}/raw_metadata" \
--transformed_metadata_path "${PREPROCESS_OUTPUT}/transformed_metadata" \
--transform_savedmodel "${PREPROCESS_OUTPUT}/transform_fn" \
--eval_data_paths "${PREPROCESS_OUTPUT}/features_eval*" \
--train_data_paths "${PREPROCESS_OUTPUT}/features_train*"