09-reproducibility.Rmd

# Reproducibility

::: {.goals}
-   How to organize your project directory
-   Turning an RMarkdown notebook into a shareable document.
-   Using Git to track changes.
-   Using `{renv}` snapshots to manage package versions.
-   Other ways to improve reproducibility
:::

Workflows that are based on graphical user interfaces are difficult to reproduce! In
contrast, when you code your analysis with R, all the steps are laid out in your script.
Well-written code allows someone else to retrace everything you have done and to
understand how you got from your data to your results.

To ensure that your results are truly reproducible, you must take additional measures. You
need to make sure that your code produces the same results on another computer and at a
later point in time. Moreover you should provide enough documentation to allow others to
understand the motivation behind your analysis.

You may be thinking that nobody will care to reproduce your analysis. Even if this is
true, it is quite possible that *you* will at some later point get back to it, and your
future self will certainly be grateful for making their job as easy as possible.

## Self-contained projects with RStudio

You have already learned how to store your analysis as a self-contained *RStudio project*.
This was the first step to reproducibility, because RStudio projects are portable and
self-contained.

In addition, you should get familiar with best practices for organizing and naming the
files in your project folder. As a primer, I recommend reading *Good Enough Practices for
Scientific Computing* [@wilsonGoodEnoughPractices2017]. A few of the key points are:

-   Store your data in a `data/` directory. Keep your raw data as raw as possible.
-   ...
-   ...

## Reproducible reports with RMarkdown

Another tool for reproducible analyses that we have already introduced is *RMarkdown*. As
you know, an *RMarkdown* document lets you mix text, R code, and the output of that R
code.

Now here is something amazing about RMarkdown: With the click of a button, you can turn an
RMarkdown document into something that you can share with others, even if they do not have
R! Let's try this with our file `analysis.Rmd`

::: {.exercise}
Click the knit button.
:::

OK, but it still looks rather .... computerish. How do I make something that a normal
person would want to read, or a journal would be willing to accept? As so often n R, there
are packages that can help you with that. Packages to make pretty APA [tables for
models](http://www.danieldsjoberg.com/gtsummary/index.html),
[summary](https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html)
[stats](https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html),
...

Note that RStudio has a visual markdown editor that makes authoring longer documents more
convenient. You can activate it with the button on the top right of the script panel.

And if the data changes or a journal reviewer asks you to exclude a subject? Update the
code and click the button again --- all the figures and stats update! No more copy and
paste!

### Other output formats

You can use RMarkdown to make a wide range of documents:

-   Manuscripts. The [`{rticles}`](https://github.com/rstudio/rticles) package has
    templates for many popular journals and publishers. The
    [`{papaja}`](https://github.com/crsh/papaja) package lets you create manuscript in the
    style of the American Psychological Association (APA).

-   Books with [`{bookdown}`](https://bookdown.org/) (such as this one)

-   websites, blogs with [`{blogdown}`](https://github.com/rstudio/blogdown),
    [`{hugodown}`](https://hugodown.r-lib.org/),
    [`{distill}`](https://rstudio.github.io/distill/)

-   slides with [`{xaringan}`](https://github.com/yihui/xaringan), (also check out
    [`{xaringanthemer}`](https://github.com/gadenbuie/xaringanthemer)`)`

-   [teaching with
    {learnR}](https://tinystats.github.io/teacups-giraffes-and-statistics/02_bellCurve.html)

-   Interface with Microsoft Office - Excel, Word, PowerPoint with
    [`{officeR}`](https://davidgohel.github.io/officer/)

::: {.more}
HTML is probably the most convenient and easy to use output format. If you want to use PDF
or Word, you will have to worry about page sizes and page breaks. It's best to avoid this
until later in the process. For PDF in particular, you will also have to install a
$\LaTeX$ distribution. I recommend `{tinytex}`.
:::

We will leave it at this short teaser for RMarkdown. There's much more that RMarkdown can
do, for example:

-   insert references and bibliographies

-   embed videos and interactive graphics

-   [parameterized reports](https://rmarkdown.rstudio.com/lesson-6.html) for recurring
    reports.

To learn more, check out one of the two big books of RMarkdown [@rmarkdown2018;
@rmarkdown2020], or RStudios tutorial on RMarkdown
(<https://rmarkdown.rstudio.com/lesson-1.html>). If you are a researcher, you might like
[RMarkdown for Scientists](https://rmd4sci.njtierney.com/) -- A guide to RMarkdown aimed
at researchers

## Version control with Git

Version control is an essential tool for any programming endeavour. It is like a time
machine for your code. Among other things, version control allows you to

-   store the state of your project and restore a previous state.

-   create parallel versions (branches) of your code, switch between them and merge them
    back together

-   collaborate with others on the same code, review and merge the changes that others
    have made

Using version control is very liberating because it gives you the freedom to experiment
and make mistakes. A popular version control tool is [Git](https://git-scm.com/). To start
using Git with your project you will need to do four things:

1.  **Install Git**

2.  **initialize a new Git repository** in your project directory.

3.  **Decide which files and changes you want to save** --- there may be some files that
    you do not really need to track. For example, large intermediate data files may not be
    worth tracking if they can be easily recreated from the code.

4.  **Commit the current state** --- A *Git commit* is to code what the 💾 is to a Word
    document.

In principle you could use Git in the terminal, but there are many graphical user
interfaces for Git that make things easier, and RStudio has a Git interface built
in.[^reproducibility-1]

[^reproducibility-1]: Personally, I prefer Github Desktop (<https://desktop.github.com/>)
    because it is prettier, snappier, and arguably more user friendly.

``` {.bash}
git init
git add . 
git commit -m "initial commit"
```

Congratulations! You have made your first Git commit! Now you can rest assured, that you
can always go back to the current state of the analysis, see the changes you have made
since, figure out where things broke, and undo changes if necessary. Isn't that
comforting?

To utilize the full power of Git, you will have to learn a few more things. This is
outside of the scope of this book but you can find a good introduction to Git for R users
in the book *Happy Git with R* (<https://happygitwithr.com/>). Learn to use Git and use it
often!

## Make your development environment reproducible

### Snapshot packages with {renv}

Unfortunately, using Git to track changes in your *own* code is not enough, because the
packages that you use may get updated too and sometimes, the developers decide to change
the way a function work. This can cause your code to break and it can be really
frustrating to find that your analysis from a few months ago suddenly does not run
anymore, or that it fails to run on your collaborators computer, who uses different
versions of the same packages. You can avoid this frustration by storing a snapshot of
your package library with the [`{renv}`](https://rstudio.github.io/renv/) package
[@R-renv].

Install the `{renv}` package, then -- in your project initialize `{renv}` for your project
and take a snapshot of the packages you use in your project:

```{r, eval=FALSE}
install.packages("renv")
renv::init()
renv::snapshot()
```

This creates a private, per-project library, in which new packages will be installed. This
library is isolated from other R libraries on your system. Later, you or your
collaborators can recreate this library, with the exact package versions you used by
running `renv::restore()`.

### Switching between R versions

R itself receives updates, too. While the updates are often minor and breaking changes are
rare, it is not guaranteed that today's code will work on a future version of R. It is
possible to install multiple versions of R in parallel on the same computer and switch
between them. On Windows you can do this in the RStudio interface,
`Tools ▸ Global Options ▸ General ▸ Basic ▸ Use RStudio Version ...`. On MacOs, it is a
little more complicated, but you can use a tool called *RSwitch*
(<https://rud.is/rswitch/>) that makes it a bit easier.

### Snapshot everything with Docker

Finally, you can also take the idea of snapshotting your environment a step further by
using [Docker Containers](https://www.rocker-project.org/). This is akin to creating an
entire virtual computer just for your analysis. Instead of just your analysis files, you
then track, store, and share that entire virtual computer. This is a lot more effort, but
it may be the only way to guarantee that your analysis can be reproduced out of the box.
However, this is beyond the scope of this book.

## More

R has many more tools for improving reproducibility, and as always there's more than one
option.

### Pipelines

Once your analysis becomes more complex it often contains many different outputs (models,
figures, intermediate data sets). Some of them can take long to compute and it can be
difficult to stay on top of the different steps of your analysis. There are packages that
try to solve that.

-   [`{targets}`](https://books.ropensci.org/targets/) - smart make-files for your
    analysis
-   [`{orderly}`](https://www.vaccineimpact.org/orderly/) - makes sure your reproducible
    reports are reproducible
-   `{rrtols}`

### Test-driven development

with `{testthat}, {tinytest}. {testrmd}`