-
Notifications
You must be signed in to change notification settings - Fork 0
/
DataCamp.Course_008_Introduction_to_Function_Writing_in_R
1436 lines (1046 loc) · 40.3 KB
/
DataCamp.Course_008_Introduction_to_Function_Writing_in_R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
######################################################################
######################################################################
######################################################################
# COURSE 008_Introduction to Function Writing in R
######################################################################
######################################################################
######################################################################
######## How to write a function (Module 01-008)
Calling functions
One way to make your code more readable is to be careful about the order you pass arguments when you call functions, and whether you pass the arguments by position or by name.
gold_medals, a numeric vector of the number of gold medals won by each country in the 2016 Summer Olympics, is provided.
For convenience, the arguments of median() and rank() are displayed using args(). Setting rank()'s na.last argument to "keep" means "keep the rank of NA values as NA".
Best practice for calling functions is to include them in the order shown by args(), and to only name common arguments.
# Note the arguments to rank()
args(rank)
#args sirve para mirar el argumento de una función
# Rewrite this function call, following best practices
rank(-gold_medals, na.last = "keep", ties.method = "min")
###Basics
my_fun <- function(arg1, arg2) {
# Do something
}
1. Make a template
2. Paste in the script
3. Choose the arguments
4. Replace specific values with argument names
5. Make specific variable names more general
6. Remove a final assignment
#####
Your first function: tossing a coin
Time to write your first function! It's a really good idea when writing functions to start simple. You can always make a function more complicated later if it's really necessary, so let's not worry about arguments for now.
coin_sides <- c("head", "tail")
# Sample from coin_sides once
sample(coin_sides, size = 1)
# Your functions, from previous steps
toss_coin <- function() {
coin_sides <- c("head", "tail")
sample(coin_sides, 1)
}
# Call your function
toss_coin()
Inputs to functions
Most functions require some sort of input to determine what to compute. The inputs to functions are called arguments. You specify them inside the parentheses after the word "function."
coin_sides <- c("head", "tail")
n_flips <- 10
# Sample from coin_sides n_flips times with replacement
sample(coin_sides, size = n_flips, replace = TRUE)
# Update the function to return n coin tosses
toss_coin <- function(n_flips) {
coin_sides <- c("head", "tail")
sample(coin_sides, size = n_flips, replace = TRUE)
}
# Generate 10 coin tosses
toss_coin(10)
Multiple inputs to functions
If a function should have more than one argument, list them in the function signature, separated by commas.
To solve this exercise, you need to know how to specify sampling weights to sample(). Set the prob argument to a numeric vector with the same length as x. Each value of prob is the probability of sampling the corresponding element of x, so their values add up to one. In the following example, each sample has a 20% chance of "bat", a 30% chance of "cat" and a 50% chance of "rat".
sample(c("bat", "cat", "rat"), 10, replace = TRUE, prob = c(0.2, 0.3, 0.5))
coin_sides <- c("head", "tail")
n_flips <- 10
p_head <- 0.8
# Define a vector of weights
weights <- c(p_head, 1 - p_head)
# Update so that heads are sampled with prob p_head
sample(coin_sides, n_flips, replace = TRUE, prob = weights)
# Update the function so heads have probability p_head
toss_coin <- function(n_flips, p_head) {
coin_sides <- c("head", "tail")
# Define a vector of weights
weights <- c(p_head, 1 - p_head)
# Modify the sampling to be weighted
sample(coin_sides, n_flips, replace = TRUE, prob = weights)
}
# Generate 10 coin tosses
toss_coin(10, 0.8)
Renaming GLM
R's generalized linear regression function, glm(), suffers the same usability problems as lm(): its name is an acronym, and its formula and data arguments are in the wrong order.
To solve this exercise, you need to know two things about generalized linear regression:
glm() formulas are specified like lm() formulas: response is on the left, and explanatory variables are added on the right.
To model count data, set glm()'s family argument to poisson, making it a Poisson regression.
Here's you'll use data on the number of yearly visits to Snake River at Jackson Hole, Wyoming, snake_river_visits.
snake_river_visits <- readRDS("~/snake_river_visits.rds")
# Run a generalized linear regression
glm(
# Model no. of visits vs. gender, income, travel
n_visits ~ gender + income + travel,
# Use the snake_river_visits dataset
data = snake_river_visits,
# Make it a Poisson regression
family = poisson
)
# Write a function to run a Poisson regression
run_poisson_regression <- function(data, formula) {
glm(formula, data, family = poisson)
}
# From previous step
run_poisson_regression <- function(data, formula) {
glm(formula, data, family = poisson)
}
# Re-run the Poisson regression, using your function
model <- snake_river_visits %>%
run_poisson_regression(n_visits ~ gender + income + travel)
# Run this to see the predictions
snake_river_explanatory %>%
mutate(predicted_n_visits = predict(model, ., type = "response"))%>%
arrange(desc(predicted_n_visits))
######################################################################
######################################################################
######################################################################
######## All about arguments (Module 02-008)
############### 1. defaults arguments
Numeric defaults
cut_by_quantile() converts a numeric vector into a categorical variable where quantiles define the cut points. This is a useful function, but at the moment you have to specify five arguments to make it work. This is too much thinking and typing.
By specifying default arguments, you can make it easier to use. Let's start with n, which specifies how many categories to cut x into.
A numeric vector of the number of visits to Snake River is provided as n_visits.
# Set the default for n to 5
cut_by_quantile <- function(x, n = 5, na.rm, labels, interval_type) {
probs <- seq(0, 1, length.out = n + 1)
qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}
# Remove the n argument from the call
cut_by_quantile(
n_visits,
na.rm = FALSE,
labels = c("very low", "low", "medium", "high", "very high"),
interval_type = "(lo, hi]"
)
Logical defaults
cut_by_quantile() is now slightly easier to use, but you still always have to specify the na.rm argument. This removes missing values - it behaves the same as the na.rm argument to mean() or sd().
Where functions have an argument for removing missing values, the best practice is to not remove them by default (in case you hadn't spotted that you had missing values). That means that the default for na.rm should be FALSE.
# Set the default for na.rm to FALSE
cut_by_quantile <- function(x, n = 5, na.rm = FALSE, labels, interval_type) {
probs <- seq(0, 1, length.out = n + 1)
qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}
# Remove the na.rm argument from the call
cut_by_quantile(
n_visits,
labels = c("very low", "low", "medium", "high", "very high"),
interval_type = "(lo, hi]"
)
NULL defaults
The cut() function used by cut_by_quantile() can automatically provide sensible labels for each category. The code to generate these labels is pretty complicated, so rather than appearing in the function signature directly, its labels argument defaults to NULL, and the calculation details are shown on the ?cut help page.
# Set the default for labels to NULL
cut_by_quantile <- function(x, n = 5, na.rm = FALSE, labels = NULL, interval_type) {
probs <- seq(0, 1, length.out = n + 1)
qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}
# Remove the labels argument from the call
cut_by_quantile(
n_visits,
interval_type = "(lo, hi]"
)
Categorical defaults
When cutting up a numeric vector, you need to worry about what happens if a value lands exactly on a boundary. You can either put this value into a category of the lower interval or the higher interval. That is, you can choose your intervals to include values at the top boundary but not the bottom (in mathematical terminology, "open on the left, closed on the right", or (lo, hi]). Or you can choose the opposite ("closed on the left, open on the right", or [lo, hi)). cut_by_quantile() should allow these two choices.
The pattern for categorical defaults is:
function(cat_arg = c("choice1", "choice2")) {
cat_arg <- match.arg(cat_arg)
}
Free hint: In the console, type head(rank) to see the start of rank()'s definition, and look at the ties.method argument.
# Set the categories for interval_type to "(lo, hi]" and "[lo, hi)"
cut_by_quantile <- function(x, n = 5, na.rm = FALSE, labels = NULL,
interval_type = c("(lo, hi]", "[lo, hi)")) {
# Match the interval_type argument
interval_type <- match.arg(interval_type)
probs <- seq(0, 1, length.out = n + 1)
qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}
# Remove the interval_type argument from the call
cut_by_quantile(n_visits)
Clever categorical default setting! As a bonus, match.arg() handles throwing an error if the user types a value that wasn't specified.
###############################2. Passing arguments between functions
Harmonic mean
The harmonic mean is the reciprocal of the arithmetic mean of the reciprocal of the data. That is
harmonic_mean(x)=1/arithmetic_mean(1/x)
The harmonic mean is often used to average ratio data. You'll be using it on the price/earnings ratio of stocks in the Standard and Poor's 500 index, provided as std_and_poor500. Price/earnings ratio is a measure of how expensive a stock is.
The dplyr package is loaded.
STEP 01
# Look at the Standard and Poor 500 data
glimpse(std_and_poor500)
# Write a function to calculate the reciprocal
get_reciprocal <- function(x) {
1/x
}
get_reciprocal
STEP 02
# From previous step
get_reciprocal <- function(x) {
1 / x
}
# Write a function to calculate the harmonic mean
calc_harmonic_mean <- function(x) {
x %>%
get_reciprocal %>%
mean %>%
get_reciprocal
}
STEP 03
# From previous steps
get_reciprocal <- function(x) {
1 / x
}
calc_harmonic_mean <- function(x) {
x %>%
get_reciprocal() %>%
mean() %>%
get_reciprocal()
}
std_and_poor500 %>%
# Group by sector
group_by(sector) %>%
# Summarize, calculating harmonic mean of P/E ratio
summarize(hmean_pe_ratio = calc_harmonic_mean(pe_ratio))
Dealing with missing values
In the last exercise, many sectors had an NA value for the harmonic mean. It would be useful for your function to be able to remove missing values before calculating.
Rather than writing your own code for this, you can outsource this functionality to mean().
The dplyr package is loaded.
# Add an na.rm arg with a default, and pass it to mean()
calc_harmonic_mean <- function(x, na.rm = FALSE) {
x %>%
get_reciprocal() %>%
mean(na.rm = na.rm) %>%
get_reciprocal()
}
or
# Add an na.rm arg with a default, and pass it to mean()
calc_harmonic_mean <- function(x, ...) {
x %>%
get_reciprocal() %>%
mean(...) %>%
get_reciprocal()
}
STEP 02. Cambiando el argymento NA true
# From previous step
calc_harmonic_mean <- function(x, na.rm = FALSE) {
x %>%
get_reciprocal() %>%
mean(na.rm = na.rm) %>%
get_reciprocal()
}
std_and_poor500 %>%
# Group by sector
group_by(sector) %>%
# Summarize, calculating harmonic mean of P/E ratio
summarize(hmean_pe_ratio = calc_harmonic_mean(pe_ratio, na.rm = TRUE))
Passing arguments with ...
Rather than explicitly giving calc_harmonic_mean() and na.rm argument, you can use ... to simply "pass other arguments" to mean().
calc_harmonic_mean <- function(x, ...) {
x %>%
get_reciprocal() %>%
mean(...) %>%
get_reciprocal()
}
std_and_poor500 %>%
# Group by sector
group_by(sector) %>%
# Summarize, calculating harmonic mean of P/E ratio
summarize(hmean_pe_ratio = calc_harmonic_mean(pe_ratio, na.rm = TRUE))
The dplyr package is loaded.
#####
______________________________
x %>%
log() %>%
mean() %>%
exp()
set this into a function, you have tu pass the argument
calc_geometric_mean <- function(x, na.rm = FALSE) {
x %>%
log() %>%
mean(na.rm = na.rm) %>%
exp()
}
____
calc_geometric_mean <- function(x, ...) {
x %>%
log() %>%
mean(...) %>%
exp()
}
__________________________________________
checking for arguments
calc_geometric_mean <- function(x, ...) {
if(!is.numeric(x)) {
stop("x is not of class 'numeric'; it has class '", class(x), "'.")
}
x %>%
log() %>%
mean(...) %>%
exp()
}
______
Ckecking types of inputs
assert package
assert_is_numeric()
assert_is_character()
is_data.frame()
...
is_two_sided_formula()
is_tskernel()
_________________
checking for arguments
calc_geometric_mean <- function(x, ...) {
assert_is_numeric(x)
assert_all_are_positive(x)
x %>%
log() %>%
mean(...) %>%
exp()
}
__________________
custom checks
checking for arguments
calc_geometric_mean <- function(x, ...) {
assert_is_numeric(x)
if(any(is_non_positive(x), na.rm = TRUE)) {
stop("x contains non-positive values, so the geometric mean makes no sense.")
}
x %>%
log() %>%
mean(...) %>%
exp()
}
_____________________________________
######
Throwing errors with bad arguments
If a user provides a bad input to a function, the best course of action is to throw an error letting them know. The two rules are
Throw the error message as soon as you realize there is a problem (typically at the start of the function).
Make the error message easily understandable.
You can use the assert_*() functions from assertive to check inputs and throw errors when they fail.
library(assertive)
calc_harmonic_mean <- function(x, na.rm = FALSE) {
# Assert that x is numeric
assert_is_numeric(x)
x %>%
get_reciprocal() %>%
mean(na.rm = na.rm) %>%
get_reciprocal()
}
# See what happens when you pass it strings
calc_harmonic_mean(std_and_poor500$sector)
Custom error logic
Sometimes the assert_*() functions in assertive don't give the most informative error message. For example, the assertions that check if a number is in a numeric range will tell the user that a value is out of range, but the won't say why that's a problem. In that case, you can use the is_*() functions in conjunction with messages, warnings, or errors to define custom feedback.
The harmonic mean only makes sense when x has all positive values. (Try calculating the harmonic mean of one and minus one to see why.) Make sure your users know this!
calc_harmonic_mean <- function(x, na.rm = FALSE) {
assert_is_numeric(x)
# Check if any values of x are non-positive
if(any(is_non_positive(x), na.rm = TRUE)) {
# Throw an error
stop("x contains non-positive values, so the harmonic mean makes no sense.")
}
x %>%
get_reciprocal() %>%
mean(na.rm = na.rm) %>%
get_reciprocal()
}
# See what happens when you pass it negative numbers
calc_harmonic_mean(std_and_poor500$pe_ratio - 20)
Fixing function arguments
The harmonic mean function is almost complete. However, you still need to provide some checks on the na.rm argument. This time, rather than throwing errors when the input is in an incorrect form, you are going to try to fix it.
na.rm should be a logical vector with one element (that is, TRUE, or FALSE).
The assertive package is loaded for you.
# Update the function definition to fix the na.rm argument
calc_harmonic_mean <- function(x, na.rm = FALSE) {
assert_is_numeric(x)
if(any(is_non_positive(x), na.rm = TRUE)) {
stop("x contains non-positive values, so the harmonic mean makes no sense.")
}
# Use the first value of na.rm, and coerce to logical
na.rm <- coerce_to(use_first(na.rm), target_class = "logical")
x %>%
get_reciprocal() %>%
mean(na.rm = na.rm) %>%
get_reciprocal()
}
# See what happens when you pass it malformed na.rm
calc_harmonic_mean(std_and_poor500$pe_ratio, na.rm = 1:5)
######################################################################
######################################################################
######################################################################
######## Return values and scope (Module 03-008)
Returning values from functions
Returning values from functions
Reason for returning early
1. You already know the answer
2. The input is an edge case.
#Whitout
simple_sum <- function(x) {
total <- 0
for(value in x) {
total <- total + value
}
total
}
#With
simple_sum <- function(x) {
if(anyNA(x)) {
return(NA)
}
total <- 0
for(value in x) {
total <- total + value
}
total
}
####
calc_geometric_mean <- function(x, ...) {
assert_is_numeric(x)
if(any(is_non_positive(x), na.rm = TRUE)) {
warning("x contains non-positive values, so the geometric mean makes no sense.")
return(NaN)
}
na.rm <- coerce_to(use_first(na.rm), target_class = "logical")
x %>%
log() %>%
mean(...) %>%
exp()
}
##### Hiding the return value ---> useful for plot functions
#With
simple_sum <- function(x) {
if(anyNA(x)) {
return(NA)
}
total <- 0
for(value in x) {
total <- total + value
}
invisible(total)
}
Returning early
Sometimes, you don't need to run through the whole body of a function to get the answer. In that case you can return early from that function using return().
To check if x is divisible by n, you can use is_divisible_by(x, n) from assertive.
Alternatively, use the modulo operator, %%. x %% n gives the remainder when dividing x by n, so x %% n == 0 determines whether x is divisible by n. Try 1:10 %% 3 == 0 in the console.
To solve this exercise, you need to know that a leap year is every 400th year (like the year 2000) or every 4th year that isn't a century (like 1904 but not 1900 or 1905).
assertive is loaded.
is_leap_year <- function(year) {
# If year is div. by 400 return TRUE
if(year %% 400 == 0) {
return(TRUE)
}
# If year is div. by 100 return FALSE
if(year %% 100 == 0) {
return(FALSE)
}
# If year is div. by 4 return TRUE
if(is_divisible_by(year, 4)) {
return(TRUE)
}
# Otherwise return FALSE
else {
return(FALSE)
}
}
Returning invisibly
When the main purpose of a function is to generate output, like drawing a plot or printing something in the console, you may not want a return value to be printed as well. In that case, the value should be invisibly returned.
The base R plot function returns NULL, since its main purpose is to draw a plot. This isn't helpful if you want to use it in piped code: instead it should invisibly return the plot data to be piped on to the next step.
Recall that plot() has a formula interface (though the arguments are the wrong way round, like lm(), because the detail argument, formula, comes before the data argument.).
plot(y ~ x, data = data)
STEP 1
# Using cars, draw a scatter plot of dist vs. speed
plt_dist_vs_speed <- plot(dist ~ speed, data = cars)
# Oh no! The plot object is NULL
plt_dist_vs_speed
STEP 2
# Define a scatter plot fn with data and formula args
pipeable_plot <- function(data, formula) {
# Call plot() with the formula interface
plot(formula, data = data)
# Invisibly return the input dataset
invisible(data)
}
# Draw the scatter plot of dist vs. speed again
plt_dist_vs_speed <- cars %>%
pipeable_plot(formula = dist ~ speed)
# Now the plot object has a value
plt_dist_vs_speed
######################################################################
Returning multiple values from functions
R.version.string
Sys.info()[c("sysname", "release")]
loadedNamespaces()
________________
#ERROR
session <- function() {
r_version <- R.version.string,
operating_system <- Sys.info()[c("sysname", "release")],
loaded_pkgs <- loadedNamespaces()
# ???
}
#Defining session()
session <- function() {
list(
r_version <- R.version.string,
operating_system <- Sys.info()[c("sysname", "release")],
loaded_pkgs <- loadedNamespaces()
)
}
session()
_______
zeallot package
install.package(zeallot)
library(zeallot)
c(vrsn, os, pkgs) %<-% session()
### attributes
month_no <- setNames(1:12, month.abb)
month_no
attributes(month_no)
attr(month_no, "names")
attr(month_no, "names") <- month.name
month_no
#names change to be the full month names
Ex: dataframe
orange_trees
attributes(orange_trees)
library(dplyr)
orange_trees %>%
group_by(Tree) %>%
attributes()
___________________
excersice
broom package
install.package(broom)
library(broom)
Model objects are converted into 3 data frames
function/level/example
glance()/model/degrees of freedom
tidy()/coefficient/p-values
augment()/observation/residuals
Returning many things
Functions can only return one value. If you want to return multiple things, then you can store them all in a list.
If users want to have the list items as separate variables, they can assign each list element to its own variable using zeallot's multi-assignment operator, %<-%.
glance(), tidy(), and augment() each take the model object as their only argument.
The Poisson regression model of Snake River visits is available as model. broom and zeallot are loaded.
library(zeallot)
library(broom)
STEP 01
# Look at the structure of model (it's a mess!)
str(model)
# Use broom tools to get a list of 3 data frames
list(
# Get model-level values
model = glance(model),
# Get coefficient-level values
coefficients = tidy(model),
# Get observation-level values
observations = augment(model)
)
STEP 02
# Wrap this code into a function, groom_model
groom_model <- function(model) {
list(
model = glance(model),
coefficients = tidy(model),
observations = augment(model)
)
}
formals(groom_model)
# From previous step
groom_model <- function(model) {
list(
model = glance(model),
coefficients = tidy(model),
observations = augment(model)
)
}
# Call groom_model on model, assigning to 3 variables
c(mdl, cff, obs) %<-% groom_model(model)
# See these individual variables
mdl; cff; obs
Returning metadata
Sometimes you want the return multiple things from a function, but you want the result to have a particular class (for example, a data frame or a numeric vector), so returning a list isn't appropriate. This is common when you have a result plus metadata about the result. (Metadata is "data about the data". For example, it could be the file a dataset was loaded from, or the username of the person who created the variable, or the number of iterations for an algorithm to converge.)
In that case, you can store the metadata in attributes. Recall the syntax for assigning attributes is as follows.
attr(object, "attribute_name") <- attribute_value
pipeable_plot <- function(data, formula) {
plot(formula, data)
# Add a "formula" attribute to data
attr(data, "formula") <- formula
invisible(data)
}
# From previous exercise
plt_dist_vs_speed <- cars %>%
pipeable_plot(dist ~ speed)
# Examine the structure of the result
str(plt_dist_vs_speed)
################################################################################################################################################################################################################################################
Environments
Environments are like an special list
#This is a list
datacamp_lst <- list(
name = "DataCamp"
founding_year = 2003
website = "https://www.datacamp.com"
)
ls.str(datacamp_lst)
#Let's convert the list into an enviroment
datacamp_env <- list2env(datacamp_lst)
ls.str(datacamp_env)
#Enviroment has a parents, like matroska dolls
parent <- parent.env(datacamp_env)
environmentName(parent)
..."R_GlobalEnv"
grandparent <- parent.env(parent)
environmentName(grandparent)
..."package:stats"
search()
###ex
datacamp_lst <- list(
name = "DataCamp"
website = "https://www.datacamp.com"
)
datacamp_env <- list2env(datacamp_lst)
founding_year <- 2013
exists("founding_year", envir = datacamp_env)
...TRUE
if the envieroment doesn't have the question{
then will ask the parent, then the grandparent
and so.. util it find it
if you don't want "exist" function to be
so greedy you put inherits = FALSE
exists("founding_year", envir = datacamp_env, inherits = FALSE)
if the envieroment doesn't have the question{
then will ask the parent, then the grandparent
and so.. util it find it
if you don't want "exist" function to be
so greedy you put inherits = FALSE
exist("founding_year", envir = datacamp_env, inherits = FALSE)
####
Creating and exploring environments
Environments are used to store other variables. Mostly, you can think of them as lists, but there's an important extra property that is relevant to writing functions. Every environment has a parent environment (except the empty environment, at the root of the environment tree). This determines which variables R know about at different places in your code.
Facts about the Republic of South Africa are contained in capitals, national_parks, and population.
STEP 01
# Add capitals, national_parks, & population to a named list
rsa_lst <- list(
capitals = capitals,
national_parks = national_parks,
population = population
)
# List the structure of each element of rsa_lst
ls.str(rsa_lst)
STEP 02
# From previous step
rsa_lst <- list(
capitals = capitals,
national_parks = national_parks,
population = population
)
# Convert the list to an environment
rsa_env <- list2env(rsa_lst)
# List the structure of each variable
ls.str(rsa_env)
STEP 03
# From previous steps
rsa_lst <- list(
capitals = capitals,
national_parks = national_parks,
population = population
)
rsa_env <- list2env(rsa_lst)
# Find the parent environment of rsa_env
parent <- parent.env(rsa_env)
# Print its name
environmentName(parent)
Do variables exist?
If R cannot find a variable in the current environment, it will look in the parent environment, then the grandparent environment, and so on until it finds it.
rsa_env has been modified so it includes capitals and national_parks, but not population.
# Compare the contents of the global environment and rsa_env
ls.str(globalenv())
ls.str(rsa_env)
# Does population exist in rsa_env?
exists("population", envir = rsa_env)
# Does population exist in rsa_env, ignoring inheritance?
exists("population", envir = rsa_env, inherits = FALSE)
########################
########################
########################
########################
#Ex
x_times_7 <- function(x) {
x * y
}
x_times_y(10)
!!!ERROR NO Y arg
# we defined y outside the function
x_times_7 <- function(x) {
x * y
}
y <- 4
x_times_y(10)
40
# when it doesn't found the y in the
function env, it look in the parent env
#continue ex
print(x)
!!!Error, x it search in parentenv, so x is a child,
and ypu can look inside the funcitoon env from putside
#continue ex
x_times_7 <- function(x) {
y <- 6
x + y
}
y <- 4
x_times_y(10)
16
#continue ex
x_times_7 <- function(x) {
x <- 9
y <- 6
x + y
}
y <- 4
x_times_y(10)
15
######################################################################
######################################################################
######################################################################
######## Case study on grain yields (Module 04-008)
magrittr's pipeable operators replacement
operator // Functional alternative
x * y // x %>% multiply_by(y)
x ^ y // x %>% raise_to_power(y)
x[y] // x %>% extract(y)
magrittr packages
install.packages(magrittr)
library(magrittr)
Converting areas to metric 1
In this chapter, you'll be working with grain yield data from the United States Department of Agriculture, National Agricultural Statistics Service. Unfortunately, they report all areas in acres. So, the first thing you need to do is write some utility functions to convert areas in acres to areas in hectares.
To solve this exercise, you need to know the following: