intro_to_ggplot2.Rmd

---
title: "Introduction to ggplot2"
author: "Elika Bergelson"
date: "6/24/2018"
output: 
  html_document:
    toc: true
    toc_float: true
---

# Intro and preliminaries

If you haven't already, install these packages (you don't have to do this every time), so if you need to, uncomment the line below and run it.

```{r}
#install.packages(c("tidyverse","knitr", "Hmisc"))
```

Then you load them with the "library" command. Confusingly, when you load the tidyverse library, some of its sub-libraries automatically load, and others need to be separately loaded (e.g. `broom`).

```{r}
library(tidyverse)
library(knitr)
library(broom)
library(Hmisc)

# some useful settings (options) ------------------------------------
options(tibble.width = 300,
        dplyr.width = 300)
# these make datasets easier to see when they get displayed on screen
# later, you can mess with them and see what they do if you want.
```


Reminder: how to get help from R. Put a question mark in front of a function or built-in/loaded dataset, and help will appear!

```{r}
?mean
?diamonds
#(mean is a function, diamonds is a dataset)
```


You can get in-line help by pressing tab as you go: R will autocomplete what you're typing within the function it will give you hints about the arguments the function takes try it out by typing `mean(` in the console below and then hitting tab.

# Data preprocessing

## Reading in data

you probably already read in the data in the intro script, but if you're just jumping in 


```{r}
ma_data <- read_csv("datasets/mental_abacus_data.csv")
ps_data <- read_csv("datasets/pragmatic_scales_data.csv")

```

remember, you can use summary, glimpse, and View to remind yourself what these data files look like (always good to be very careful with that!)

```{r}
summary(ma_data)
# View(ps_data)
glimpse(ma_data)
```

let's just make 2 simple aggregated version of this dataset, by subj & items

## Review of `dyplr`

```{r}
ps_data_bysubj_cond <- ps_data %>% #take your dataset
  group_by(subid, condition) %>% #retain subject and condition, collapse everything else, i.e. item 
  summarise(mean_corr = mean(correct, na.rm = TRUE),#create mean for each subj,cond
            sum_corr = sum(correct, na.rm = TRUE))# create sum for each subj,cond

ps_data_byitem_cond <- ps_data %>% #take your dataset
  group_by(item, condition) %>% #retain subject and item, collapse everything else, i.e. subj 
  summarise(mean_corr = mean(correct, na.rm = TRUE),#create mean for each item,cond
            sum_corr = sum(correct, na.rm = TRUE))# create sum for each item,cond

```


> protip: commenting what every single line does is great practice when you're stuck!

# On to graphing!

## Scatterplots!

Check out iris. 

```{r}
?iris
```

Let's plot.

```{r}
ggplot(data = iris)+
   geom_point(mapping = aes(x=Sepal.Length,
                            y =Species))
```

Now use the same approach on `ps_data`.

```{r}
ggplot(data = ps_data_byitem_cond)+
  geom_point(mapping = aes(x=condition,
                           y =sum_corr))
ps_data_byitem_cond
```

What's wrong with this graph?

```{r}
ggplot(data = ma_data)+
  geom_point(mapping = aes(x=grade,
                           y =arithmeticAverage))
ma_data
```


## jitter those points! 

one thing you should always ask yourself is: how many points,bars, lines should i be seeing?

```{r}
ggplot(data = ma_data)+
  geom_jitter(mapping = aes(x = grade,
                            y = arithmeticAverage))
```

hm, that's better, but now it feels a little TOO spread out, let's reign it in

```{r}
ggplot(data = ma_data)+
  geom_jitter(mapping = aes(x=grade,
                            y =arithmeticAverage),
              width = .2,
              height = 0)
```

##**Exercise.** Task 1. First scatterplots.
+ a) using the ps_data_byitem_cond, make a scatterplot of mean correct (x axis) by condition (y axis)
+ b) using the built-in iris dataset, make a scatterplot of Species by Petal.Width
+ c) using ps_data_bysubj_condition, make a scatterplot of sum correct by condition where you can appropriately see the dots

## aesthetics 

```{r}
ggplot(data = iris)+
  geom_point(mapping = aes(x=Sepal.Length,
                           y =Petal.Width,
                           color = Species))
```

```{r}
ggplot(data = iris)+
  geom_point(mapping = aes(x=Sepal.Length,
                           y =Petal.Width,
                           alpha = Species))

```

```{r}
ggplot(data = iris)+
  geom_point(mapping = aes(x=Sepal.Length,
                           y =Petal.Width,
                           shape = Species,
                           alpha = Species))
```

```{r}
ggplot(data = iris)+
  geom_point(mapping = aes(x=Sepal.Length,
                           y =Petal.Width,
                           size = Species))

```

```{r}
ggplot(data = iris)+
  geom_point(mapping = aes(x=Sepal.Length,
                           y =Petal.Width),
             size = 4)
```

```{r}
ggplot(data = iris)+
  geom_point(mapping = aes(x=Sepal.Length,
                           y =Petal.Width))+
  facet_wrap(facets = ~Species)
```


## **Exercise.** For Task 2, use ma_data and a scatterplot of your choosing (jittered if needed!).
+ a) set the shape of all the dots in a scatterplot to an asterisk
+ b) map a continuous variable onto color (hint: use 'summary' to see what's continuous!)
+ c) map a discrete variable (a factor or character) onto shape
+ d) map a continuous variable (an integer or double) onto shape
+ e) make a graph of your choosing using facet_wrap
+ f) advanced: make a graph of your choosing using facet_wrap AND one with facet_grid:what's the difference?

# Moving forward: Other geoms 

`geom_line` graph (we refer to this graph below in task 3c)

```{r}
ggplot(data = ma_data, aes(x= factor(year),#this just makes it treat year as a factor
                           y= arithmeticAverage, 
                           group = subid))+# group keeps the 'unit' at subid
  geom_point()+
  geom_line()
```

geom_hline
geom_text

```{r}
ggplot(data = ps_data_byitem_cond, mapping = aes(x=condition,
                                                 y=mean_corr))+
  geom_point()+

  geom_hline(yintercept = .5)+ #hey, this adds a line!
  geom_text(label = "1b", x = .7, y= .2, color = "purple")# this but '1b' in the corner!
```

the x and y tell it where to put the text, here `label` is 1 on the x axis

## Visualizing distributions 

in 1d; we refer to this graph in task 3d

```{r}
ggplot(data = ma_data, aes(x=gonogo))+
  geom_histogram(binwidth=.10)

```

in 2d

```{r}
ggplot(data = ps_data_bysubj_cond, aes(x=condition, y = mean_corr))+
  geom_boxplot()

```

with density info:

```{r}
ggplot(data = ps_data_bysubj_cond, aes(x=condition, y = mean_corr))+
  geom_violin()

```

with density AND dots!

```{r}
ggplot(data = ps_data_bysubj_cond, aes(x=condition, y = mean_corr))+
  geom_violin()+
  geom_jitter(width=.1, height=.01, shape =1)# i like shape #1 for legibility
```

## statistical transformation: smoothers

(and examples of 'local' vs. 'global' variable setting)

global x and y, color just for geom_point

```{r}
ggplot(data = ma_data, mapping = aes(x = arithmeticTotal, y = gonogo)) + 
  geom_point(mapping = aes(color = grade)) + 
  stat_smooth()
```

all vars global: what's the difference?

```{r}
ggplot(data = ma_data, mapping = aes(x = arithmeticTotal, y = gonogo, color = grade)) + 
  geom_point() + 
  stat_smooth()
```

filter the data for a layer

```{r}
ggplot(data = ma_data, mapping = aes(x = arithmeticTotal, y = gonogo, color = grade)) + 
  geom_point() + 
  stat_smooth(data = filter(ma_data,grade=="first grade"))# the smoother only gets grade 1 data!
```

take out a class, remove confidence bnd

```{r}
ggplot(data = ma_data, mapping = aes(x = arithmeticTotal, y = gonogo, color = grade)) + 
  geom_point() + #the points include everyone
  stat_smooth(data = filter(ma_data,group != "MA"), 
              se = FALSE) # but the smoother doesn't see MA group
```

what does `se = FALSE` do?

`stat_smooth` default is `loess` (local estimator)

```{r}
ggplot(data = ma_data, mapping = aes(x = arithmeticTotal, y = gonogo)) + 
  geom_point(aes(color = grade)) + 
  stat_smooth()
```

but you can make it fit a line

```{r}
ggplot(data = ma_data, mapping = aes(x = arithmeticTotal, y = gonogo)) + 
  geom_point(aes(color = grade)) + 
  stat_smooth( method="lm")
```

## **Exercise.** Task 3. Geoms, distributions, and smoothers.
+ a): go back to one of the scatter plots from #1 and add a loess smooth, and a line
+ b): using ma_data, make a boxplot of swm for every value of woodcockTotal
+ c): go back to the geom_line graph above and separate the data by grade (multiple solutions!)
+ d): more advanced: come up with a solution so that the histogram only has each subject represented 1x

# Adding error bars

mean by condition, no error bars yet

```{r}
ggplot(data = ps_data, aes(x = condition, y = correct)) + 
  stat_summary(fun.y=mean, 
               na.rm=T, 
               geom = "bar")
```

barbarplots? [cf twitter]

95% confidence interval

```{r}
ggplot(data = ps_data, aes(x = condition, y = correct)) + 
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") #fun.data, not fun.y!

```

`mean_cl_boot` is boostrapped confidence intervals, you can google what regular normal CIs would be!

both:

```{r}
ggplot(data = ps_data, aes(x = condition, y = correct)) + 
  stat_summary(fun.y = mean, na.rm=T, geom = "bar")+
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange")
```

## errors bars with violins

same as violin plot above, but now with an errorbar!

```{r}
ggplot(data = ps_data_bysubj_cond, aes(x=condition, y = mean_corr))+
  geom_violin()+
  stat_summary(fun.data=mean_cl_normal, geom = "pointrange")
```


## stack and dodge 

> protip: use `fill` with bars not colour!

bonus question: what does colour do for bars?

```{r}
ggplot(data = ma_data) + 
  geom_bar(mapping = aes(x = woodcockTotal, fill = grade), position = "fill")

ggplot(data = ma_data) + 
  geom_bar(mapping = aes(x = woodcockTotal, fill = grade), position = "stack")

ggplot(data = ma_data) + 
  geom_bar(mapping = aes(x = woodcockTotal, fill = grade), position = "dodge")
```

## **Exercise**. Task 4: error bars, and stack & dodge 
+ a): using the ps dataset, graph means for each item & add error bars
+ b): make a bargraph of the `ps_data_byitem_cond` showing the mean_corr for each condition using geom_bar 
 (hint, you'll need to use "stat=" insde your `geom_bar()` call
+ c): when would it be most appropriate to use `fill`, `stack`, or `dodge`?

## Saving your graph 

```{r}
?ggsave()
```

ggsave will save your *last* plot by default, but you can also tell it save a plot you've assigned.

```{r}
mygraph <- ggplot(data = iris)+
  geom_point(mapping = aes(x=Sepal.Length,
                           y =Petal.Width,
                           color = Species))

mygraph
ggsave("mygraph.pdf",plot = mygraph,dpi = 100)
```

even better than saving your graph: add it to your R Markdown! the awesome thing about using your `.Rmd` file is that you can render graphs there, and they get saved for you!

there are LOTS of settings you can muck with. (details here https://yihui.name/knitr/options/#plots). we'll do this back in our .rmd file

# Graph Wishes

### **Exercise.** Task 5. Split into groups for task wishes.
+ Group A: Individual datapoints + summary stats. 
+ Group B: Distribution-based Wishes. 
+ Group C: Time-course graph based wishes.

hint for group a

```{r}
ggplot(data = ma_data, aes(x= factor(year),#this just makes it treat year as a factor
                           y= arithmeticAverage, 
                           group = subid))+# this keeps the 'unit' at subid
  geom_point()+
  geom_line()+facet_wrap(~grade)+
  stat_summary(color = "red", size = 3, geom="line", fun.y=mean, aes(group =grade))
```

hint 1 for group b

```{r}
xvar <- c(rnorm(1500, mean = -1), rnorm(1500, mean = 1.5))
yvar <- c(rnorm(1500, mean = 1), rnorm(1500, mean = 1.5))
zvar <- as.factor(c(rep(1, 1500), rep(2, 1500)))
xy <- data.frame(xvar, yvar, zvar)
ggplot(xy, aes(xvar, yvar)) + geom_point() + geom_rug(col = "darkred", alpha = 0.1)
```

further hints: [here](http://felixfan.github.io/ggplot2-cheatsheet/) and [here](https://stackoverflow.com/questions/35366499/ggplot2-how-to-combine-histogram-rug-plot-and-logistic-regression-prediction)

hint 2 for Group b

all the code for this graph appears to be here, BUT this person did not do things the tidy way!
https://micahallen.org/2018/03/15/introducing-raincloud-plots/. exercise for the reader: do his wrangling the tidy way!

but using our `ps_data_bysubj_cond` and sourcing this:

```{r}
source("https://gist.githubusercontent.com/benmarwick/2a1bb0133ff568cbe28d/raw/fb53bd97121f7f9ce947837ef1a4c65a73bffb3f/geom_flat_violin.R")

```

you should be able to make your raincloud:)

hint for group c: here's a sample dataset and a graph to get you started in the right direction

```{r}
library(feather)# this is part of tidyverse, but not auto-loaded
```

feather is a format that's convenient for various reasons

```{r}
coart<- read_feather("datasets/coart_test")
summary(coart)
```

do you know what each of these lines do? can you make errorbars?

```{r}
ggplot(subset(coart, Nonset<5000 & Nonset>-1500),
       aes(Nonset, propt, color = TrialType))+
  geom_hline(yintercept=.5)+
  ylab("proportion of target looking")+
  xlab("time from target onset")+
  geom_vline(xintercept=0)+
  stat_smooth(geom="point")+
  theme_bw(base_size=18)
```

# Extras for the curious

## adding regression line

if all we wanted to do was add a regression line, we'd just use `stat_smooth`:
note this is like the graph we did with the errorbars above, just edited a little

```{r}
ggplot(ToothGrowth, aes(x=dose, y=len, colour=supp)) + 
  stat_summary(fun.y = mean, geom = "point",  size = 4) +
  geom_point( size = 1)+
  stat_smooth(method="lm")
```

but if we want to know what the actual formula for that line is, we have to calculate some things:

first we need a linear model

```{r}
ourmodel <- lm(data = ToothGrowth, len~dose*supp)
```

if you want to know more about the results you do a summary of the model

```{r}
summary(ourmodel)
```

if you want the summary results to look prettier you tidy the model

```{r}
tidy(ourmodel)
```

in our case, we can use the results of the model to manually put in a line, but there are fancier ways to do this that are beyond the scope of this tutorial

```{r}
ggplot(ToothGrowth, aes(x=dose, y=len, colour=supp)) + 
  stat_summary(fun.y = mean, geom = "point",  size = 4) +
  geom_point( size = 1)+
  stat_smooth(method="lm")+
  annotate(x = 1, y = 30, "text", label = "y = 11.55 + 7.8 *dose + -8.26*suppVC + 3.9 * dose * suppVC")
```

note: for annotate, the x and y is where on the graph you want your text to go

if you want to check this formula, you can plug in some values:

```{r}
#11.55 + 7.8 *dose + -8.26*suppVC + 3.9 * dose * suppVC
11.55 + 7.8*0 + -8.26*0 + 3.9*0*0 # dose of 0 for oj
11.55 + 7.8*1 + -8.26*0 + 3.9*1*0# dose of 1 for oj
11.55 + 7.8*2 + -8.26*0 + 3.9*2*0# dose of 2 for oj

11.55 + 7.8*0 + -8.26*1 + 3.9*0*1# dose of 0 for vc
11.55 + 7.8*1 + -8.26*1 + 3.9*1*1# dose of 1 for vc
11.55 + 7.8*2 + -8.26*1 + 3.9*2*1# dose of 1 for vc
```

## manually specified errorbars 

From: [http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/]()

```{r}
tgc <- ToothGrowth%>%
  group_by(supp, dose)%>%
  summarise(n = n(),
            mean_len = mean(len),
            sd_len = sd(len),
            se_len = sd_len/sqrt(n),
            ci_len = qt((.95/2 +.5),
                        df= n-1)*se_len) # looking up 95%'s 2 tails in t-dist

ggplot(tgc, aes(x=dose, y=mean_len, colour=supp)) + 
  geom_errorbar(aes(ymin=mean_len - ci_len, ymax = mean_len + ci_len)) +
  geom_line() +
  geom_point()

ggplot(ToothGrowth, aes(x=dose, y=len, colour=supp)) + 
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar") +
  stat_summary(fun.y = mean, geom = "point") +
  stat_summary(fun.y = mean, geom = "line") 
```