Venmo Research

The code used in the paper Contact Tracing With Venmo as a part of UT Austin's Computational Media Lab.

Usage & Replication

Database Setup

You'll need a fresh postgres database hosted on a server with at least ~100GB of free storage. If you have the disk space and reasonable hardware could also just download and run postgres locally. I also heavily recommend the use of pgAdmin for debugging and exploring the database.

You'll need to add have following environment variables when running all the commands below: POSTGRES_PASS, POSTGRES_ADDR, POSTGRES_USER, POSTGRES_DB. For example:

export POSTGRES_PASS=password
export POSTGRES_ADDR=127.0.0.1:5432
export POSTGRES_USER=postgres
export POSTGRES_DB=venmo

Download Research Code

Download and extract the latest binaries from releases.

If you're familar with Go you could also clone this repo and go run things.

Venmo Collection

Create a Venmo account
Use your Venmo login to generate an API key with scripts/login.py. This only has to be done once as the API key does not expire.
Collect data

Randomly scrape transactions by user

./scrape -mode transactions -token <your API key here> -random

Scrape transactions of user with an ID between 0 and 95000000 using 5 parallel workers.

./scrape -mode transactions -token <your API key here> -start_id 0 -end_id 95000000 -workers 5

As machine 2 of 10 (0-indexed), scrape transactions of users with an ID between 0 and 95000000 using 5 parallel workers.

./scrape -mode transactions -token <your API key here> -start_id 0 -end_id 95000000 -workers 5 -shard_idx 2 -shard_cnt 10

Continously scrape the latest transactions from `https://venmo.com/api/v5/public`.

./scrape -mode transactions2 -token <your API key here>

View help

./scrape -h

Name Search (finding social media profiles)

Randomly sample Venmo users from database and look them up on Bing, DuckDuckGo, and PeekYou.

./scrape -mode namesearch -workers 1

Geotag Extraction (scraping Facebook)

Create a Facebook account (the account must be created with a phone number to avoid being blocked)
Install Chrome
Download the chromedriver and as well as the latest selenium server
Collect data

Randomly sample users with PeekYou matches and extract geotags

./scrape -mode peekyoulocs -fb_user <facebook phone number> -fb_pass <facebook password> -sel_driver chromedriver -sel_headless -workers 3

Analysis & Visualization

Open a jupyter notebook in this repo
Pip install necessary dependencies
Edit the connect() function to match the parameters for your database
Run the notebook cells in order

Our Dataset

Creating our dataset took several months and with several API changes Venmo collection may no longer be possible at this scale (135M transactions, 22.1M users). Open an issue here or contact us if you would like to receive a copy of our dataset (note: we'll need to verify your use case and intentions before hand, additional restrictions may apply).

Use (with parameters adjusted for your postgres installation) to replicate the database used when running our notebooks:

$ pg_restore --host "127.0.0.1" --port "5432" --username "postgres" --no-password --dbname "venmo" --verbose "dataset.sql"

Schema

{
	'users':{
		'created':'timestamp without time zone',
		'bing_results':'json',
		'facebook_results':'json',
		'facebook_profile':'json',
		'peek_you_results':'json',
		'is_business':'boolean',
		'cancelled':'boolean',
		'id':'bigint',
		'last_name':'character varying',
		'username':'character varying',
		'picture_url':'character varying',
		'name':'character varying',
		'ddg_results':'text',
		'external_id':'character varying',
		'first_name':'character varying'
	},
	'transactions':{
		'created':'timestamp without time zone',
		'updated':'timestamp without time zone',
		'actor_user_id':'bigint',
		'recipient_id':'bigint',
		'id':'bigint',
		'message':'character varying',
		'story':'character varying',
		'type':'character varying'
	},
	'user_to_transactions':{
		'user_id':'bigint',
		'transaction_id':'bigint',
		'is_actor':'boolean'
	}
}

TACC Suggestions

TACC can be a huge pain compared to any cloud provider but it can be useful as a free (for us at UT) compute resource. Personally, I only used it for jobs running with the transactions and namesearch mode. You can use scripts/scrape.tacc.job as a template for doing this. Keep in mind that you'll need to download and extract the latest release, update the environment variables (see placeholders in the script), and run $ sbatch scrape.tacc.job while on a stampede2.tacc.utexas.edu login node.

It would be extremely useful to run postgres directly on TACC, but running a database as a job is pretty weird (I contacted them and that's only way of doing it now) as it will only run for fix amount of time (e.g. 6 hours) before shutting down and you'll have to wait for the job queue before it even starts. If you do want to still try this, I've left some snippets below that may be useful.

# after starting an interactive job w/idev
# use docker (TACC uses docker alt called singularity) to run postgres server
module load tacc-singularity
singularity pull docker://postgres
SINGULARITYENV_POSTGRES_PASSWORD=pgpass SINGULARITYENV_PGDATA=$SCRATCH/pgdata singularity run --cleanenv --bind $SCRATCH:/var postgres_latest.sif

# portforwarding with ssh magic (copied from VNC demo script), you could maybe ngrok tcp 5432 instead (?)
NODE_HOSTNAME=`hostname -s`
for i in `seq 4`; do
    ssh -q -f -g -N -R 15426:$NODE_HOSTNAME:15426 login$i
done
ssh -f -N -L 15426:stampede2.tacc.utexas.edu:15426 <your username>@stampede2.tacc.utexas.edu

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
cmd		cmd
facebook		facebook
images		images
notebooks		notebooks
scripts		scripts
storage		storage
venmo		venmo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Venmo Research

Usage & Replication

Database Setup

Download Research Code

Venmo Collection

Randomly scrape transactions by user

Scrape transactions of user with an ID between 0 and 95000000 using 5 parallel workers.

As machine 2 of 10 (0-indexed), scrape transactions of users with an ID between 0 and 95000000 using 5 parallel workers.

Continously scrape the latest transactions from `https://venmo.com/api/v5/public`.

View help

Name Search (finding social media profiles)

Randomly sample Venmo users from database and look them up on Bing, DuckDuckGo, and PeekYou.

Geotag Extraction (scraping Facebook)

Randomly sample users with PeekYou matches and extract geotags

Analysis & Visualization

Our Dataset

Schema

TACC Suggestions

About

Releases 11

Languages

License

sshh12/venmo-research

Folders and files

Latest commit

History

Repository files navigation

Venmo Research

Usage & Replication

Database Setup

Download Research Code

Venmo Collection

Randomly scrape transactions by user

Scrape transactions of user with an ID between 0 and 95000000 using 5 parallel workers.

As machine 2 of 10 (0-indexed), scrape transactions of users with an ID between 0 and 95000000 using 5 parallel workers.

Continously scrape the latest transactions from https://venmo.com/api/v5/public.

View help

Name Search (finding social media profiles)

Randomly sample Venmo users from database and look them up on Bing, DuckDuckGo, and PeekYou.

Geotag Extraction (scraping Facebook)

Randomly sample users with PeekYou matches and extract geotags

Analysis & Visualization

Our Dataset

Schema

TACC Suggestions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 11

Languages

Continously scrape the latest transactions from `https://venmo.com/api/v5/public`.