data_manipulation_syntax.Rmd

---
title: "Data Manipulation Tutorial - Datafest 2019"
author: "Vlad Achimescu"
date: "May 3, 2019"
output: html_document
---

## LOAD PACKAGES

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# install.packages(c("readr","magrittr","dplyr","tidyr"))
### load packages
library(readr)
library(magrittr)
library(dplyr)
library(tidyr)
### make sure filter and select are assigned to dplly
filter = dplyr::filter
select = dplyr::select
```


## IMPORT DATA

Read data as a tibble, using read_csv or read_tsv.

Read employment dataset
```{r cars}
EMP = read_tsv("empl_rates_reg.tsv")
EMP
```

Read education&training dataset
```{r}
TRAIN = read_csv("training_participation.csv")
TRAIN
```


## CHAINING FUNCTIONS - PIPE OPERATOR

Pipe operator simplifies reading and writing chained functions.
Shortcut for pipe operator "%>%" is <Ctrl+Shift+M>.
If previous value is not used as a first argument, but later, use "." to specify the position

Example: summarise unemployment rate in 2018

```{r}
# Extract unemployment rates
unempl2018  = 100 - EMP$Y2018
summary(unempl2018)

# Base R - chaining
paste("Result:", round(quantile(log(unempl2018,2),.25, na.rm = TRUE), 1))
      
# Base R - separate operations
log_unempl   = log(unempl2018,2)
quant_unempl = quantile(log_unempl, .25, na.rm = TRUE)
round_quant = round(quant_unempl, 1)
paste("Result:", round_quant)

# Magrittr - chaining using pipe
unempl2018 %>% 
  log(2) %>% 
  quantile(.25, na.rm = TRUE) %>% 
  round(1) %>% 
  paste("Result:",.)

# Magrittr - one line
unempl2018 %>% log(2) %>% quantile(.25, na.rm = TRUE) %>% round(1) %>% paste("Result:",.)
```

EXERCISE: Create a table of relative frequencies, sorted by descending:
```{r}
#Base R:
cbind(round(rev(sort(prop.table(table(TRAIN$SEX)))),2))

#Translate code above using piping:

```

```{r}
### additional functions (aliases) useful for piping
?magrittr::extract
```


## Slice and filter

Slice

```{r pressure, echo=FALSE}
# Base R
TRAIN[1:5,]

# dplyr
slice(TRAIN, 1:5)

# With piping
TRAIN %>% slice(1:5)
```


Subset rows / filter(logical_condition)

```{r}
### Base R - select cases from 2018 with more than 30% training
TRAIN[TRAIN$Value>30 & TRAIN$TIME == 2018,]

### dplyr
TRAIN %>% filter(Value>30, TIME == 2018)
```

Filter by strings - regular expressions / sets

```{r}
### Select all countries starting with the letter "C"
TRAIN %>% filter(grepl("^C",GEO))

### Select Slovakia and Slovenia
TRAIN %>% filter(GEO %in% c("Slovakia","Slovenia"))
```

EXERCISE: select only data from 2018 and 2019 from Romania - 
total and save in new variable FILT
```{r}
#FILT = TRAIN %>% filter(insert_conditions_here)
```

## Subset variables/columns - select

Select function makes code similar to Stata

```{r}
### Base R - select first three variables
EMP[,1:3]
EMP[,c("unit", "sex", "age" )]

### dplyr
EMP %>% select(1:3)
EMP %>% select("unit", "sex", "age")
EMP %>% select(unit, sex, age)
EMP %>% select(unit:age)
```

Can also use helper functions (starts_with, one_of, contains)

```{r}
### select by skipping
EMP %>% select(geo, age, sex, Y2016:Y2018)

### unselect
EMP %>% select(-unit)

### select by variable name
EMP %>% select(starts_with("Y"))
```

EXERCISE: select from TRAIN only year, 2-letter country code, sex and value and save in new variable SEL
```{r}
#SEL = TRAIN %>% select(your_selection_here)
```


## Create new variables / mutate, transmute

```{r}
### Base R
TRAIN$low = TRAIN$Value<10

### dplyr
TRAIN %>% mutate(low = Value<10)
```

Mutate keeps all variables, transmute only keeps new variables

```{r}
### mutate
TRAIN %>% mutate(low = Value<10,
                 cumval = cumsum(Value),
                 period = ifelse(TIME<2010, "before 2010","2010 or after"))

### transmute
TRAIN %>% transmute(GEO,
                    low = Value<10,
                    cumval = cumsum(Value),
                    period = ifelse(TIME<2010, "before 2010","2010 or after"))

### can use window functions in : lag, cumsum, min_rank, etc. (see cheat sheet)
```


Mutate specific variables with mutate_if and mutate_at

```{r}
### Change variables that follow a certain condition
## mutate_if(condition, function)
# e.g.make all strings into factors
TRAIN %>% mutate_if(is.character, as.factor)

### Change specific variables
## mutate_at(vars(), funs())
# e.g. round all Y variables
EMP %>% mutate_at(vars(Y2007:Y2018), funs(round))
```


Mutate_all replaces sapply or tapply

```{r}
### apply same functions to all variables

## apply predefined function
TRAIN %>% mutate_all(nchar)

## apply custom function
TRAIN %>% mutate_all(function(x) paste0(x,"_suffix"))
```


EXERCISE: 
Use the piping operator and dplyr functions to
select from EMP the following variables: sex, geo and Y2017 and Y2018,
filter only cases from Austria, 
create new variable eployment_change as the difference between 
employment rates in 2018 and 2017
and save in new data frame called NEW
```{r}
#NEW = TRAIN %>% add_code_here
```


## Aggregate data

Let us select from EMP for each gender the median training participation level

```{r}
### Base R 
aggregate(Value ~ SEX, data = TRAIN, FUN = median)

### dplyr 
TRAIN  %>% group_by(SEX) %>% 
           summarise(training_median = median(Value, na.rm = TRUE))
```

Advantage of dplyr is that multiple summaries can be created in the same syntax

```{r}
### get mean, median and standard deviation for each gender
TRAIN_AGG = TRAIN  %>% 
            group_by(SEX) %>% 
            summarise(tr_median = median(Value, na.rm = TRUE),
                      tr_mean = mean(Value, na.rm = TRUE),
                      tr_sd = sd(Value, na.rm = TRUE))
TRAIN_AGG
```

Can also group by multiple variables.
Example: get mean and standard deviation by gender and year

```{r}
TRAIN  %>% group_by(TIME, SEX)
TRAIN_AGG2 =TRAIN  %>% 
            group_by(TIME, SEX) %>% 
            summarise(tr_median = median(Value, na.rm = TRUE),
                      tr_mean = mean(Value, na.rm = TRUE),
                      tr_sd = sd(Value, na.rm = TRUE))
TRAIN_AGG2
# This is a tidy dataset!
```

Add aggregate variable to non-aggregated dataset.
Example: average training per country

```{r}
TRAIN2 = TRAIN %>% 
         filter(SEX == "Total") %>% 
         group_by(COUNTRY) %>% 
         mutate(avg_train = mean(Value)) %>% 
         ungroup() %>% 
         mutate(dif_from_avg = Value - avg_train) 
#in last command we compute the difference in percentages from average country value
TRAIN2
```

EXERCISE: 
From TRAIN dataset calculate the average training level per country, 
then rank them by the aggregated value using the function arrange().
What is the country with the highest average training level?
```{r}
#TRAIN %>% group_by(add_value) %>% summarise(add_value) %>% arrange(add_value)
```


## Reshape datasets

Separate columns
In EMP, var geo contains both country and region.

```{r}
EMP2 = EMP %>% separate(geo, into = c("COUNTRY","region"), sep="_")
EMP2
```

From wide to long format
Gather all Y variables in EMP

```{r}
# give new names for key and value
EMP_LONG = EMP2 %>% 
           select(COUNTRY, region, sex, contains("Y")) %>% 
           gather(key = year,
                  value = perc,
                  -COUNTRY, -region, -sex) %>% 
           mutate(year = as.numeric(substr(year, 2,5)))
EMP_LONG
```

From long to wide format.
Useful to create crosstabs. For example, put gender as columns. 

```{r}
# you should keep only the necessary variables
TRAIN %>% select(TIME, COUNTRY, SEX, Value) %>% 
          spread(key = SEX,
                 value = Value)
```

## Combine/merge datasets

First let us select only the important variables from both datasets

```{r}
EMP_2018 = EMP_LONG %>% filter(year == 2018, sex == "T") %>% 
                        select(COUNTRY, region, empl_perc = perc)
TRAIN_2018 = TRAIN %>% filter(SEX == "Total", TIME == 2018) %>% 
                       select(COUNTRY, training_perc = Value)
EMP_2018 
TRAIN_2018
```


Inner join keeps values that appear in both data frames
EMP is at regional level, TRAIN at country level.
```{r}
INNER_UNITE = inner_join(EMP_2018, TRAIN_2018) %>% 
              mutate_if(is.character, as.factor) %>% 
              filter(complete.cases(.))
summary(INNER_UNITE)
INNER_UNITE
```


## finally plotting the data

```{r}
library(ggplot2)
INNER_UNITE %>% 
  ggplot(aes(x = training_perc, y = empl_perc, col = COUNTRY)) + 
    geom_text(aes(label = COUNTRY)) + geom_smooth(aes(col = NULL), alpha = 0, col = "black") +
    guides(col = "none") 
```