-
Notifications
You must be signed in to change notification settings - Fork 0
/
09-reproducibility.Rmd
232 lines (171 loc) · 9.99 KB
/
09-reproducibility.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
# Reproducibility
::: {.goals}
- How to organize your project directory
- Turning an RMarkdown notebook into a shareable document.
- Using Git to track changes.
- Using `{renv}` snapshots to manage package versions.
- Other ways to improve reproducibility
:::
Workflows that are based on graphical user interfaces are difficult to reproduce! In
contrast, when you code your analysis with R, all the steps are laid out in your script.
Well-written code allows someone else to retrace everything you have done and to
understand how you got from your data to your results.
To ensure that your results are truly reproducible, you must take additional measures. You
need to make sure that your code produces the same results on another computer and at a
later point in time. Moreover you should provide enough documentation to allow others to
understand the motivation behind your analysis.
You may be thinking that nobody will care to reproduce your analysis. Even if this is
true, it is quite possible that *you* will at some later point get back to it, and your
future self will certainly be grateful for making their job as easy as possible.
## Self-contained projects with RStudio
You have already learned how to store your analysis as a self-contained *RStudio project*.
This was the first step to reproducibility, because RStudio projects are portable and
self-contained.
In addition, you should get familiar with best practices for organizing and naming the
files in your project folder. As a primer, I recommend reading *Good Enough Practices for
Scientific Computing* [@wilsonGoodEnoughPractices2017]. A few of the key points are:
- Store your data in a `data/` directory. Keep your raw data as raw as possible.
- ...
- ...
## Reproducible reports with RMarkdown
Another tool for reproducible analyses that we have already introduced is *RMarkdown*. As
you know, an *RMarkdown* document lets you mix text, R code, and the output of that R
code.
Now here is something amazing about RMarkdown: With the click of a button, you can turn an
RMarkdown document into something that you can share with others, even if they do not have
R! Let's try this with our file `analysis.Rmd`
::: {.exercise}
Click the knit button.
:::
OK, but it still looks rather .... computerish. How do I make something that a normal
person would want to read, or a journal would be willing to accept? As so often n R, there
are packages that can help you with that. Packages to make pretty APA [tables for
models](http://www.danieldsjoberg.com/gtsummary/index.html),
[summary](https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html)
[stats](https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html),
...
Note that RStudio has a visual markdown editor that makes authoring longer documents more
convenient. You can activate it with the button on the top right of the script panel.
And if the data changes or a journal reviewer asks you to exclude a subject? Update the
code and click the button again --- all the figures and stats update! No more copy and
paste!
### Other output formats
You can use RMarkdown to make a wide range of documents:
- Manuscripts. The [`{rticles}`](https://github.com/rstudio/rticles) package has
templates for many popular journals and publishers. The
[`{papaja}`](https://github.com/crsh/papaja) package lets you create manuscript in the
style of the American Psychological Association (APA).
- Books with [`{bookdown}`](https://bookdown.org/) (such as this one)
- websites, blogs with [`{blogdown}`](https://github.com/rstudio/blogdown),
[`{hugodown}`](https://hugodown.r-lib.org/),
[`{distill}`](https://rstudio.github.io/distill/)
- slides with [`{xaringan}`](https://github.com/yihui/xaringan), (also check out
[`{xaringanthemer}`](https://github.com/gadenbuie/xaringanthemer)`)`
- [teaching with
{learnR}](https://tinystats.github.io/teacups-giraffes-and-statistics/02_bellCurve.html)
- Interface with Microsoft Office - Excel, Word, PowerPoint with
[`{officeR}`](https://davidgohel.github.io/officer/)
::: {.more}
HTML is probably the most convenient and easy to use output format. If you want to use PDF
or Word, you will have to worry about page sizes and page breaks. It's best to avoid this
until later in the process. For PDF in particular, you will also have to install a
$\LaTeX$ distribution. I recommend `{tinytex}`.
:::
We will leave it at this short teaser for RMarkdown. There's much more that RMarkdown can
do, for example:
- insert references and bibliographies
- embed videos and interactive graphics
- [parameterized reports](https://rmarkdown.rstudio.com/lesson-6.html) for recurring
reports.
To learn more, check out one of the two big books of RMarkdown [@rmarkdown2018;
@rmarkdown2020], or RStudios tutorial on RMarkdown
(<https://rmarkdown.rstudio.com/lesson-1.html>). If you are a researcher, you might like
[RMarkdown for Scientists](https://rmd4sci.njtierney.com/) -- A guide to RMarkdown aimed
at researchers
## Version control with Git
Version control is an essential tool for any programming endeavour. It is like a time
machine for your code. Among other things, version control allows you to
- store the state of your project and restore a previous state.
- create parallel versions (branches) of your code, switch between them and merge them
back together
- collaborate with others on the same code, review and merge the changes that others
have made
Using version control is very liberating because it gives you the freedom to experiment
and make mistakes. A popular version control tool is [Git](https://git-scm.com/). To start
using Git with your project you will need to do four things:
1. **Install Git**
2. **initialize a new Git repository** in your project directory.
3. **Decide which files and changes you want to save** --- there may be some files that
you do not really need to track. For example, large intermediate data files may not be
worth tracking if they can be easily recreated from the code.
4. **Commit the current state** --- A *Git commit* is to code what the 💾 is to a Word
document.
In principle you could use Git in the terminal, but there are many graphical user
interfaces for Git that make things easier, and RStudio has a Git interface built
in.[^reproducibility-1]
[^reproducibility-1]: Personally, I prefer Github Desktop (<https://desktop.github.com/>)
because it is prettier, snappier, and arguably more user friendly.
``` {.bash}
git init
git add .
git commit -m "initial commit"
```
Congratulations! You have made your first Git commit! Now you can rest assured, that you
can always go back to the current state of the analysis, see the changes you have made
since, figure out where things broke, and undo changes if necessary. Isn't that
comforting?
To utilize the full power of Git, you will have to learn a few more things. This is
outside of the scope of this book but you can find a good introduction to Git for R users
in the book *Happy Git with R* (<https://happygitwithr.com/>). Learn to use Git and use it
often!
## Make your development environment reproducible
### Snapshot packages with {renv}
Unfortunately, using Git to track changes in your *own* code is not enough, because the
packages that you use may get updated too and sometimes, the developers decide to change
the way a function work. This can cause your code to break and it can be really
frustrating to find that your analysis from a few months ago suddenly does not run
anymore, or that it fails to run on your collaborators computer, who uses different
versions of the same packages. You can avoid this frustration by storing a snapshot of
your package library with the [`{renv}`](https://rstudio.github.io/renv/) package
[@R-renv].
Install the `{renv}` package, then -- in your project initialize `{renv}` for your project
and take a snapshot of the packages you use in your project:
```{r, eval=FALSE}
install.packages("renv")
renv::init()
renv::snapshot()
```
This creates a private, per-project library, in which new packages will be installed. This
library is isolated from other R libraries on your system. Later, you or your
collaborators can recreate this library, with the exact package versions you used by
running `renv::restore()`.
### Switching between R versions
R itself receives updates, too. While the updates are often minor and breaking changes are
rare, it is not guaranteed that today's code will work on a future version of R. It is
possible to install multiple versions of R in parallel on the same computer and switch
between them. On Windows you can do this in the RStudio interface,
`Tools ▸ Global Options ▸ General ▸ Basic ▸ Use RStudio Version ...`. On MacOs, it is a
little more complicated, but you can use a tool called *RSwitch*
(<https://rud.is/rswitch/>) that makes it a bit easier.
### Snapshot everything with Docker
Finally, you can also take the idea of snapshotting your environment a step further by
using [Docker Containers](https://www.rocker-project.org/). This is akin to creating an
entire virtual computer just for your analysis. Instead of just your analysis files, you
then track, store, and share that entire virtual computer. This is a lot more effort, but
it may be the only way to guarantee that your analysis can be reproduced out of the box.
However, this is beyond the scope of this book.
## More
R has many more tools for improving reproducibility, and as always there's more than one
option.
### Pipelines
Once your analysis becomes more complex it often contains many different outputs (models,
figures, intermediate data sets). Some of them can take long to compute and it can be
difficult to stay on top of the different steps of your analysis. There are packages that
try to solve that.
- [`{targets}`](https://books.ropensci.org/targets/) - smart make-files for your
analysis
- [`{orderly}`](https://www.vaccineimpact.org/orderly/) - makes sure your reproducible
reports are reproducible
- `{rrtols}`
### Test-driven development
with `{testthat}, {tinytest}. {testrmd}`