8-reporting.Rmd

---
title: "Data Analysis & Reporting"
output:
  html_notebook: default
  html_document: default
---

```{r setup, echo=F, results='hide'}
Sys.setenv(R_NOTEBOOK_HOME = getwd())
library(ggplot2)
source("config.R")
source("helpers.R")
```

> This R notebook can be executed from within R, you can update the database connection properties and the dataset settings in the `config.R` file, or run the commands specified here interactively if you want to experiment.

In the last section of the artifact, all the graphs, tables & values used in the paper will be recreated. 

## Fig 1 - Heatmap

```{r}
createHeatmap(DATASET_NAME, DATASET_PATH, "commits")
```

## Table 1 - Corpus

For total, non-forked and unique URL project counts, run the scc preprocessor (replace the second argument with other language of your choice):

```{bash}
cd $R_NOTEBOOK_HOME
cd tools/sccpreprocessor/src
java SccPreprocessor stats ../../../ghtorrent/projects.csv JavaScript
```    

The rest of data is provided in the following snippet:

```{r}
tableCorpus(DATASET_NAME)
```

## Table 2 - File Level Duplication

```{r}
fileLevelDup(DATASET_NAME)
```

## Fig 3 - File Level Duplication

This graph has been created in Excel from the data in Table 2 above.

## Table 3 - File Level Duplication Excluding Small Files

```{r}
fileLevelDupNoSmall(DATASET_NAME)
```

## Fig 4 - File Level Duplication Excluding Small Files

This graph has been created in Excel from the data in Table 3 above.

## Table 4 - Inter Project Cloning

```{r}
interProjectCloning(DATASET_NAME)
```

## Fig 5 - Percentage of project clones at various levels of overlap.

This graph has been created in Excel from the data in table 4 above.

## Table 5 -  Number of tokens per file within certain percentiles of the distribution of file size.

```{r}
tokensPerFileQuantiles(DATASET_NAME)
```

## Table 6 - Corpus for Metadata Analysis.

```{r}
metadataCorpus(DATASET_NAME)
```

## Fig 10 - Files per project distributions.

```{r}
filesPerProjectDist(DATASET_NAME, DATASET_PATH, "Java")
```


## Fig 11 - SLOC per file distributions.

```{r}
slocPerFileDist(DATASET_NAME, "Java")
```


## Fig 12 - Stars per project distributions.

```{r}
starsPerProjectDist(DATASET_NAME, DATASET_PATH, "Java")
```


## Fig 13 - Commits per project distributions.

```{r}
commitsPerProjectDist(DATASET_NAME, DATASET_PATH, "Java")
```

## Table 7 -  Summary statistics for the entire dataset.

```{r}
summaryStats(DATASET_NAME)
```

## Table 8 - Summary statistics for the minimum set of files (distinct token hashes).

```{r}
summaryStatsTokenHash(DATASET_NAME)
```

## Table 9 - Summary statistics for the minimum set of files (distinct file hashes).

```{r}
summaryStatsFileHash(DATASET_NAME)
```

## Next Steps

[Language specific reporting](9-reporting-other.nb.html) in file [`9-reporting-other.Rmd`](9-reporting-other.Rmd).