readme-full-dataset.Rmd

---
title: "Entire Dataset"
output: html_notebook
---

This machine contains the entire datasets presented in the paper for all four languages so that you can inspect the entire data we report. Read access to the database is provided for user `oopsla` with password `vancouver`, i.e. you can connect from commandline using the following command:

    mysql -u oopsla -p
    
and then typing the password `vancouver` in the command prompt. 

## Database Layout

Each language has its own database, i.e. `cpp` for C++, `java` for Java, `python` for Python and `jsHalf` for JavaScript. You can also find the CSV files the respective subdirectories of the `~/datasets` folder of your username. 

All datasets contain at least the basic tables & columns required for the graphs, but certain languages may provide extra tables, columns or csv files. Consult the paper and the artifact pipelines in the VM for more details. 

## Producing the graphs

Calculating all the graphs for the entire datasets is very time consuming process (hours) so we do not provide an executable notebook. Instead we recommend you execute the following in the shell:

    tmux
    cd ~/pipeline
    Rscript -e "source('config.R'); source('helpers.R'); artifactFullDataset();"
    
This will run the `artifactFullDataset` function we provide that calculates the outputs described in the VM in steps 8 and 9. By using `tmux` you can close the connection and return to it any time later by executing the following command when logged on `ginger`:

    tmux attach
    
For details, see the `tmux` command help. 

When done, the output of the script will contain all numeric values calculated for the four languages. The script will also create the following directories, which would contain the graphs as presented in the paper:

    ~/pipeline/graphs/cpp
    ~/pipeline/graphs/java
    ~/pipeline/graphs/python
    ~/pipeline/graphs/jsHalf

## Dataset snapshot

You have full read access to the datasets presented in the paper under the following:

    ~/pipeline/datasets/cpp
    ~/pipeline/datasets/java
    ~/pipeline/datasets/python
    ~/pipeline/datasets/jsHalf

These contain the snapshots of the database after the processing steps.