-
Notifications
You must be signed in to change notification settings - Fork 3
/
8-reporting.Rmd
130 lines (85 loc) · 2.79 KB
/
8-reporting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: "Data Analysis & Reporting"
output:
html_notebook: default
html_document: default
---
```{r setup, echo=F, results='hide'}
Sys.setenv(R_NOTEBOOK_HOME = getwd())
library(ggplot2)
source("config.R")
source("helpers.R")
```
> This R notebook can be executed from within R, you can update the database connection properties and the dataset settings in the `config.R` file, or run the commands specified here interactively if you want to experiment.
In the last section of the artifact, all the graphs, tables & values used in the paper will be recreated.
## Fig 1 - Heatmap
```{r}
createHeatmap(DATASET_NAME, DATASET_PATH, "commits")
```
## Table 1 - Corpus
For total, non-forked and unique URL project counts, run the scc preprocessor (replace the second argument with other language of your choice):
```{bash}
cd $R_NOTEBOOK_HOME
cd tools/sccpreprocessor/src
java SccPreprocessor stats ../../../ghtorrent/projects.csv JavaScript
```
The rest of data is provided in the following snippet:
```{r}
tableCorpus(DATASET_NAME)
```
## Table 2 - File Level Duplication
```{r}
fileLevelDup(DATASET_NAME)
```
## Fig 3 - File Level Duplication
This graph has been created in Excel from the data in Table 2 above.
## Table 3 - File Level Duplication Excluding Small Files
```{r}
fileLevelDupNoSmall(DATASET_NAME)
```
## Fig 4 - File Level Duplication Excluding Small Files
This graph has been created in Excel from the data in Table 3 above.
## Table 4 - Inter Project Cloning
```{r}
interProjectCloning(DATASET_NAME)
```
## Fig 5 - Percentage of project clones at various levels of overlap.
This graph has been created in Excel from the data in table 4 above.
## Table 5 - Number of tokens per file within certain percentiles of the distribution of file size.
```{r}
tokensPerFileQuantiles(DATASET_NAME)
```
## Table 6 - Corpus for Metadata Analysis.
```{r}
metadataCorpus(DATASET_NAME)
```
## Fig 10 - Files per project distributions.
```{r}
filesPerProjectDist(DATASET_NAME, DATASET_PATH, "Java")
```
## Fig 11 - SLOC per file distributions.
```{r}
slocPerFileDist(DATASET_NAME, "Java")
```
## Fig 12 - Stars per project distributions.
```{r}
starsPerProjectDist(DATASET_NAME, DATASET_PATH, "Java")
```
## Fig 13 - Commits per project distributions.
```{r}
commitsPerProjectDist(DATASET_NAME, DATASET_PATH, "Java")
```
## Table 7 - Summary statistics for the entire dataset.
```{r}
summaryStats(DATASET_NAME)
```
## Table 8 - Summary statistics for the minimum set of files (distinct token hashes).
```{r}
summaryStatsTokenHash(DATASET_NAME)
```
## Table 9 - Summary statistics for the minimum set of files (distinct file hashes).
```{r}
summaryStatsFileHash(DATASET_NAME)
```
## Next Steps
[Language specific reporting](9-reporting-other.nb.html) in file [`9-reporting-other.Rmd`](9-reporting-other.Rmd).