Skip to content

Commit

Permalink
major refactor from top reddit posts to new viral reddit posts github…
Browse files Browse the repository at this point in the history
… org. stripped out things not related to lambda function. improved testing, fixed issues with deployment.
  • Loading branch information
kennethjmyers committed Apr 14, 2024
1 parent ea36e16 commit e6cd71c
Show file tree
Hide file tree
Showing 58 changed files with 660 additions and 14,253 deletions.
2 changes: 0 additions & 2 deletions .gitattributes

This file was deleted.

11 changes: 7 additions & 4 deletions .github/workflows/workflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,20 @@ on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
name: Test python API
name: Run tests
steps:
- uses: actions/checkout@v1
- uses: actions/checkout@v3

- name: local files
run: ls -al

- name: Set up Python
uses: actions/[email protected]
with:
python-version: '3.7'
python-version: '3.12.3'

- name: Install requirements
run: pip install -r requirements.txt
run: pip install .

- name: Run tests and collect coverage
run: pytest --cov .
Expand Down
52 changes: 52 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,58 @@ dmypy.json
# Pyre type checker
.pyre/

# Local .terraform directories
**/.terraform/*

# .tfstate files
*.tfstate
*.tfstate.*

# plan files
*-plan.out

# Crash log files
crash.log
crash.*.log

# Exclude all .tfvars files, which are likely to contain sensitive data, such as
# password, private keys, and other secrets. These should not be part of version
# control as they are data points which are potentially sensitive and subject
# to change depending on the environment.
*.tfvars
*.tfvars.json

# Ignore override files as they are usually used to override resources locally and so
# are not checked in
override.tf
override.tf.json
*_override.tf
*_override.tf.json

# Include override files you do wish to add to version control using negated pattern
# !example_override.tf

# Include tfplan files to ignore the plan output of command: terraform plan -out=tfplan
# example: *tfplan*

# Ignore CLI configuration files
.terraformrc
terraform.rc

.vscode/*
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
!.vscode/*.code-snippets

# Local History for Visual Studio Code
.history/

# Built Visual Studio Code Extensions
*.vsix


#######################
# Unique to this repo #
#######################
Expand Down
5 changes: 0 additions & 5 deletions .pre-commit-config.yaml

This file was deleted.

63 changes: 63 additions & 0 deletions .terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

126 changes: 40 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,86 +1,40 @@
# Top Reddit Posts

[![codecov](https://codecov.io/gh/kennethjmyers/Top-Reddit-Posts/branch/main/graph/badge.svg?token=ACZEU30AHM)](https://codecov.io/gh/kennethjmyers/Top-Reddit-Posts)

![Python](https://img.shields.io/badge/python-3.7-blue.svg)

![](./images/architecture-diagram.png)

This project intends to demonstrate a knowledge of:

1. Data Engineering and ETL - the collection and cleaning of data
2. Working within the AWS ecosystem - tools used include DynamoDB, Lambda functions, S3, RDS, EventBridge, IAM (managing permission sets, users, roles, policies, etc), ECR, and Fargate.
3. Data Science and Analysis - building out a simple model using collected data

**Make sure to read the [Wiki](https://github.com/kennethjmyers/Top-Reddit-Posts/wiki) for set-up instructions.**

## What is this?

This project collects data from rising posts on Reddit and identifies features that predict an upcoming viral post. It then automates the prediction process via a docker container deployed on AWS Fargate and notifies the users of potentially viral posts.

Currently the model is a GBM model and it steps up the top 3% of posts based on testing data.

### Why?

It takes a lot of time to scroll Reddit for new and rising posts to contribute to the conversation early enough where people might see it. The "Rising" view of r/pics typically has over 250 different posts reach it per day and many more than that are posted to the "New" view. However, only a handful of posts make it to the front page.

Furthermore, by the time a post reaches the front page it can have hundreds of comments and very few people will scroll past the top few posts. Therefore it is important to get to a post early to increase the likelihood that your comment will rise to the top.

This project allows one to remove much of the time and effort in continuously searching Reddit for the next big posts, and allows one to limit comments to only posts with high probability of virality. The goal is to increase the chance that your voice will be heard.

### Example

Below is a sample of [the first post](https://www.reddit.com/r/pics/comments/132ueaa/the_first_photo_of_the_chernobyl_plant_taken_by/) the bot found when it was deployed. When it first sent the notification, the post had only 29 comments but it went on to garner over 57k upvotes and over 1.4k comments.

It is easy to see how advantageous it can be to find future viral posts before they hit the front page.

![](./images/bot-example.png)

### Results

When I started using the bot, my account of 12 years had 8,800 comment karma. For the first 10 days the model only predicted on r/pics posts and in those 10 days I had increased my comment karma to over 21,300, a 142% increase. To reiterate, **I more than doubled my comment karma of 12 years in just 10 days**.

On the 10th day I expanded the model to 8 more of the top subreddits (though the model had not been trained on these) and I used this regularly for days 10-20. During this time **I was averaging about 5,000 karma per day**. Used continuously this would be about 150k karma per month and 1.8MM karma/year. On multiple occasions my replies were given Reddit awards (which was also not something I'd received in the past) and I was given "achievement" flairs for being in the top 1% of karma earners.

![](./images/Karma_growth_vs_day.png)

User results can vary, however, as it requires the user to be available when new notifications arrive and is dependent upon the user's replies and understanding of Reddit and the subreddits' rules and userbases.


#### Performance Monitoring

Performance monitoring is not yet a live process but can be found in the [monitoring notebook](model/Monitoring.ipynb).

![](./images/monitoring-precisions-recalls20230512.png)

The above plot shows precisions and recalls for various subreddits that the model has been operating on. The current version of the model was only trained on r/pics, as such it tends to perform the best there with about 45% recall and >80% precision. This is inline with the step up threshold, top 1%, that was selected. However, it performs worse for the other subreddits due to different subscriber sizes and activity levels. Data collection has been updated to plan for the next version of the model which will account for these variables.

Furthermore, the next version of the model will attempt to be more flexible to the time of prediction. The current model mainly relies on the 40-60 minute time block and most predictions only come in during the last 10 minutes of a post's first hour. But often these posts already have dozens or, in rare cases, hundreds of comments so the model must become more agile for this.


## More Information

### Requirements

1. python == 3.7 (I have not tested this on other versions and can not guarantee it will work)
2. [AWS CLI v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)

### Components

1. Check out the [Getting Started](https://github.com/kennethjmyers/Top-Reddit-Posts/wiki/Getting-Started) section of the wiki for setting up your AWS account and local environment.
2. [Lambda function](./lambdaFunctions/getRedditDataFunction/) to collect data and store in DynamoDB. See [the Wiki](https://github.com/kennethjmyers/Top-Reddit-Posts/wiki/Lambda-Function---getRedditDataFunction) for setup instructions.
3. [ETL](model/ModelETL.py), [Analysis](./model/univariateAnalysis.ipynb) and [Model creation](model/model-GBM.ipynb).
1. Currently EMR is not being utilized for the ETL process but the ETL process was written in pyspark so that it could scale on EMR with growing data.
2. DynamoDB is not really meant for bulk read and writes. As such, it is not ideal for large ETL processes. It was chosen to demonstrate knowledge in an additional datastore and because it is available to the AWS free tier. When reading data from DynamoDB to Spark, I implemented data chunking to gather multiple DynamoDB partitions before they are distributed with Spark to improve reads efficiency.
3. Model data and Model stored on S3.
4. [Docker image](model/Dockerfile) hosted on ECR and deployed on ECS via Fargate that automates [prediction ETL process](model/PredictETL.py), stage predicted results to Postgres database on RDS and send notifications via Discord to the user.

### A Note on Costs

This project tries to maximize the variety of tools to use in AWS while keeping costs low, particularly on the AWS Free Tier.

If you are on the AWS free tier, then the primary cost is the use of Fargate. Currently, while returning results for a single subreddit every 10 minutes, the cost is about $0.20/day or about $6/month.

Keep an eye on the costs though as this project uses S3, RDS, Lambda, and other services which are free within limits but will start to incur costs if you go beyond their trial limits or continue past the trial period.

In [April 2023 Reddit announced](https://www.nytimes.com/2023/04/18/technology/reddit-ai-openai-google.html) that they would be charging for API access in the future. This did not affect this project at time of creation but could affect others in the future.
![Python](https://img.shields.io/badge/python-3.12.3-blue.svg)

# Reddit Scraping

The purpose of this repo is to deploy AWS Lambda function that scrapes rising and hot reddit posts.

# How to use

1. First ensure the DynamoDB tables are set up via [DynamoDB-Setup](https://github.com/ViralRedditPosts/DynamoDB-Setup).
2. Installs - see the [prerequisites section on this page](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/aws-build#prerequisites) for additional information, the steps are essentially:
1. Install Terraform CLI
2. Install AWS CLI and run `aws configure` and enter in your aws credentials.
3. Clone this repository
4. You can run the tests locally yourself by doing the following (it is recommended that you manage your python environments with something like [asdf](https://asdf-vm.com/) and use python==3.12.3 as your local runtime):

```sh
python -m venv venv # this sets up a local virtual env using the current python runtime
source ./venv//bin/activate # activates the virtual env
pip install -e . # installs this packages in local env with dependencies
pytest . -r f -s # -r f shows extra info for failures, -s disables capturing
```

5. From within this repository run the following:

```sh
terraform init
terraform workspace new dev # this should switch you to the dev workspace
terraform plan -var-file="dev.tfvars" -out=dev-plan.out
terraform apply -var-file="dev.tfvars" dev-plan.out
```

For deploying to prd

```sh
terraform workspace new prd # or terraform workspace select prd if already created
terraform plan -var-file="prd.tfvars" -out=prd-plan.out
terraform apply -var-file="prd.tfvars" prd-plan.out
```

On subsequent updates you don't need to `init` or make a new workspace again.
65 changes: 0 additions & 65 deletions configUtils.py

This file was deleted.

2 changes: 2 additions & 0 deletions dev.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
env = "dev"
cloudwatch_state = "DISABLED"
Binary file removed images/Karma_growth_vs_day.png
Binary file not shown.
Binary file removed images/Karma_pct_Growth_vs_Day.png
Binary file not shown.
Binary file removed images/architecture-diagram.png
Binary file not shown.
Binary file removed images/bot-example.png
Binary file not shown.
Binary file removed images/lambda_function_setup.png
Binary file not shown.
Binary file removed images/layer_setup.png
Binary file not shown.
Binary file removed images/monitoring-precisions-recalls20230512.png
Binary file not shown.
Loading

0 comments on commit e6cd71c

Please sign in to comment.