1-getting-projects.Rmd

---
title: "1 - Getting Github project urls"
output: html_notebook
---

```{r setup, echo=F, results='hide'}
Sys.setenv(R_NOTEBOOK_HOME = getwd())
```

> Do not execute this notebook on the Virtual Machine during artifact evalation, all necessary tools have already been setup for you in advance.

Download a snapshot of [GHTorrent MySQL dump](http://ghtorrent.org/downloads.html), and extract the projects table, such as:

```{bash}
cd $R_NOTEBOOK_HOME
mkdir ghtorrent
cd ghtorrent
curl https://ghtstorage.blob.core.windows.net/downloads/mysql-2017-01-19.tar.gz -o ghtorrent.tar.gz
tar --extract --file=ghtorrent.tar.gz mysql-2017-01-19/projects.csv
tar --extract --file=ghtorrent.tar.gz mysql-2017-01-19/project_commits.csv
mv mysql-2017-01-19/projects.csv projects.csv
mv mysql-2017-01-19/project_commits.csv project_commits.csv
rm -rf mysql-2017-01-19
cd ..
```    

> Note that getting the entire ghtorrent and extracting the files may take a lot of time, depending on the connection speed. 
    
Note that the exact paths will differ based on the exact mysql snapshot used, but at the end of this step you should have the files `ghtorrent/projects.csv` and `ghtorrent/project_commits.csv` ready.

> The file `project_commits.csv` will be required for metadata acquisition in step 7. 

For the paper, 2016-11-01 (no longer available at GHTorrent site, archived at ginger). 

## Next Steps

[Downloading and tokenizing the projects](2-tokenizing.nb.html) in file [`2-tokenizing.Rmd`](2-tokenizing.Rmd).