-
Notifications
You must be signed in to change notification settings - Fork 0
/
data_manipulation_syntax.Rmd
364 lines (273 loc) · 8.3 KB
/
data_manipulation_syntax.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
title: "Data Manipulation Tutorial - Datafest 2019"
author: "Vlad Achimescu"
date: "May 3, 2019"
output: html_document
---
## LOAD PACKAGES
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# install.packages(c("readr","magrittr","dplyr","tidyr"))
### load packages
library(readr)
library(magrittr)
library(dplyr)
library(tidyr)
### make sure filter and select are assigned to dplly
filter = dplyr::filter
select = dplyr::select
```
## IMPORT DATA
Read data as a tibble, using read_csv or read_tsv.
Read employment dataset
```{r cars}
EMP = read_tsv("empl_rates_reg.tsv")
EMP
```
Read education&training dataset
```{r}
TRAIN = read_csv("training_participation.csv")
TRAIN
```
## CHAINING FUNCTIONS - PIPE OPERATOR
Pipe operator simplifies reading and writing chained functions.
Shortcut for pipe operator "%>%" is <Ctrl+Shift+M>.
If previous value is not used as a first argument, but later, use "." to specify the position
Example: summarise unemployment rate in 2018
```{r}
# Extract unemployment rates
unempl2018 = 100 - EMP$Y2018
summary(unempl2018)
# Base R - chaining
paste("Result:", round(quantile(log(unempl2018,2),.25, na.rm = TRUE), 1))
# Base R - separate operations
log_unempl = log(unempl2018,2)
quant_unempl = quantile(log_unempl, .25, na.rm = TRUE)
round_quant = round(quant_unempl, 1)
paste("Result:", round_quant)
# Magrittr - chaining using pipe
unempl2018 %>%
log(2) %>%
quantile(.25, na.rm = TRUE) %>%
round(1) %>%
paste("Result:",.)
# Magrittr - one line
unempl2018 %>% log(2) %>% quantile(.25, na.rm = TRUE) %>% round(1) %>% paste("Result:",.)
```
EXERCISE: Create a table of relative frequencies, sorted by descending:
```{r}
#Base R:
cbind(round(rev(sort(prop.table(table(TRAIN$SEX)))),2))
#Translate code above using piping:
```
```{r}
### additional functions (aliases) useful for piping
?magrittr::extract
```
## Slice and filter
Slice
```{r pressure, echo=FALSE}
# Base R
TRAIN[1:5,]
# dplyr
slice(TRAIN, 1:5)
# With piping
TRAIN %>% slice(1:5)
```
Subset rows / filter(logical_condition)
```{r}
### Base R - select cases from 2018 with more than 30% training
TRAIN[TRAIN$Value>30 & TRAIN$TIME == 2018,]
### dplyr
TRAIN %>% filter(Value>30, TIME == 2018)
```
Filter by strings - regular expressions / sets
```{r}
### Select all countries starting with the letter "C"
TRAIN %>% filter(grepl("^C",GEO))
### Select Slovakia and Slovenia
TRAIN %>% filter(GEO %in% c("Slovakia","Slovenia"))
```
EXERCISE: select only data from 2018 and 2019 from Romania -
total and save in new variable FILT
```{r}
#FILT = TRAIN %>% filter(insert_conditions_here)
```
## Subset variables/columns - select
Select function makes code similar to Stata
```{r}
### Base R - select first three variables
EMP[,1:3]
EMP[,c("unit", "sex", "age" )]
### dplyr
EMP %>% select(1:3)
EMP %>% select("unit", "sex", "age")
EMP %>% select(unit, sex, age)
EMP %>% select(unit:age)
```
Can also use helper functions (starts_with, one_of, contains)
```{r}
### select by skipping
EMP %>% select(geo, age, sex, Y2016:Y2018)
### unselect
EMP %>% select(-unit)
### select by variable name
EMP %>% select(starts_with("Y"))
```
EXERCISE: select from TRAIN only year, 2-letter country code, sex and value and save in new variable SEL
```{r}
#SEL = TRAIN %>% select(your_selection_here)
```
## Create new variables / mutate, transmute
```{r}
### Base R
TRAIN$low = TRAIN$Value<10
### dplyr
TRAIN %>% mutate(low = Value<10)
```
Mutate keeps all variables, transmute only keeps new variables
```{r}
### mutate
TRAIN %>% mutate(low = Value<10,
cumval = cumsum(Value),
period = ifelse(TIME<2010, "before 2010","2010 or after"))
### transmute
TRAIN %>% transmute(GEO,
low = Value<10,
cumval = cumsum(Value),
period = ifelse(TIME<2010, "before 2010","2010 or after"))
### can use window functions in : lag, cumsum, min_rank, etc. (see cheat sheet)
```
Mutate specific variables with mutate_if and mutate_at
```{r}
### Change variables that follow a certain condition
## mutate_if(condition, function)
# e.g.make all strings into factors
TRAIN %>% mutate_if(is.character, as.factor)
### Change specific variables
## mutate_at(vars(), funs())
# e.g. round all Y variables
EMP %>% mutate_at(vars(Y2007:Y2018), funs(round))
```
Mutate_all replaces sapply or tapply
```{r}
### apply same functions to all variables
## apply predefined function
TRAIN %>% mutate_all(nchar)
## apply custom function
TRAIN %>% mutate_all(function(x) paste0(x,"_suffix"))
```
EXERCISE:
Use the piping operator and dplyr functions to
select from EMP the following variables: sex, geo and Y2017 and Y2018,
filter only cases from Austria,
create new variable eployment_change as the difference between
employment rates in 2018 and 2017
and save in new data frame called NEW
```{r}
#NEW = TRAIN %>% add_code_here
```
## Aggregate data
Let us select from EMP for each gender the median training participation level
```{r}
### Base R
aggregate(Value ~ SEX, data = TRAIN, FUN = median)
### dplyr
TRAIN %>% group_by(SEX) %>%
summarise(training_median = median(Value, na.rm = TRUE))
```
Advantage of dplyr is that multiple summaries can be created in the same syntax
```{r}
### get mean, median and standard deviation for each gender
TRAIN_AGG = TRAIN %>%
group_by(SEX) %>%
summarise(tr_median = median(Value, na.rm = TRUE),
tr_mean = mean(Value, na.rm = TRUE),
tr_sd = sd(Value, na.rm = TRUE))
TRAIN_AGG
```
Can also group by multiple variables.
Example: get mean and standard deviation by gender and year
```{r}
TRAIN %>% group_by(TIME, SEX)
TRAIN_AGG2 =TRAIN %>%
group_by(TIME, SEX) %>%
summarise(tr_median = median(Value, na.rm = TRUE),
tr_mean = mean(Value, na.rm = TRUE),
tr_sd = sd(Value, na.rm = TRUE))
TRAIN_AGG2
# This is a tidy dataset!
```
Add aggregate variable to non-aggregated dataset.
Example: average training per country
```{r}
TRAIN2 = TRAIN %>%
filter(SEX == "Total") %>%
group_by(COUNTRY) %>%
mutate(avg_train = mean(Value)) %>%
ungroup() %>%
mutate(dif_from_avg = Value - avg_train)
#in last command we compute the difference in percentages from average country value
TRAIN2
```
EXERCISE:
From TRAIN dataset calculate the average training level per country,
then rank them by the aggregated value using the function arrange().
What is the country with the highest average training level?
```{r}
#TRAIN %>% group_by(add_value) %>% summarise(add_value) %>% arrange(add_value)
```
## Reshape datasets
Separate columns
In EMP, var geo contains both country and region.
```{r}
EMP2 = EMP %>% separate(geo, into = c("COUNTRY","region"), sep="_")
EMP2
```
From wide to long format
Gather all Y variables in EMP
```{r}
# give new names for key and value
EMP_LONG = EMP2 %>%
select(COUNTRY, region, sex, contains("Y")) %>%
gather(key = year,
value = perc,
-COUNTRY, -region, -sex) %>%
mutate(year = as.numeric(substr(year, 2,5)))
EMP_LONG
```
From long to wide format.
Useful to create crosstabs. For example, put gender as columns.
```{r}
# you should keep only the necessary variables
TRAIN %>% select(TIME, COUNTRY, SEX, Value) %>%
spread(key = SEX,
value = Value)
```
## Combine/merge datasets
First let us select only the important variables from both datasets
```{r}
EMP_2018 = EMP_LONG %>% filter(year == 2018, sex == "T") %>%
select(COUNTRY, region, empl_perc = perc)
TRAIN_2018 = TRAIN %>% filter(SEX == "Total", TIME == 2018) %>%
select(COUNTRY, training_perc = Value)
EMP_2018
TRAIN_2018
```
Inner join keeps values that appear in both data frames
EMP is at regional level, TRAIN at country level.
```{r}
INNER_UNITE = inner_join(EMP_2018, TRAIN_2018) %>%
mutate_if(is.character, as.factor) %>%
filter(complete.cases(.))
summary(INNER_UNITE)
INNER_UNITE
```
## finally plotting the data
```{r}
library(ggplot2)
INNER_UNITE %>%
ggplot(aes(x = training_perc, y = empl_perc, col = COUNTRY)) +
geom_text(aes(label = COUNTRY)) + geom_smooth(aes(col = NULL), alpha = 0, col = "black") +
guides(col = "none")
```