forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
94-appendixD.Rmd
838 lines (555 loc) · 42.9 KB
/
94-appendixD.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
# Learning Check Solutions {#appendixD}
<!-- Albert will make sure the exercises here match up with exercises in the
reordering of the book. -->
```{r setup_lc_solutions, include=FALSE, purl=FALSE}
knitr::opts_chunk$set(tidy = FALSE, out.width = '\\textwidth')
# This bit of code is a bug fix on asis blocks, which we use to show/not show LC solutions, which are written like markdown text. In theory, it shouldn't be necessary for knitr versions <=1.11.6, but I've found I still need to for everything to knit properly in asis blocks. More info here:
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
if (options$echo && options$eval) knit_child(text = options$code)
})
```
## Chapter 1 Solutions
```{r, include=FALSE, purl=FALSE}
chap <- 1
lc <- 0
# This controls which LC solutions to show. Options for solutions_shown: "ALL" (to show all solutions), or subsets of c('2-1', '2-2'), including the null vector c('') to show no solutions.
solutions_shown <- c('1-1', '1-2', '1-3')
# solutions_shown <- c('')
show_solutions <- function(section){return(solutions_shown == "ALL" | section %in% solutions_shown)}
```
```{r message=FALSE}
library(dplyr)
library(ggplot2)
library(nycflights13)
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Repeat the above installing steps, but for the `dplyr`, `nycflights13`, and `knitr` packages. This will install the earlier mentioned `dplyr` package, the `nycflights13` package containing data on all domestic flights leaving a NYC airport in 2013, and the `knitr` package for writing reports in R.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** "Load" the `dplyr`, `nycflights13`, and `knitr` packages as well by repeating the above steps.
**Solution**: If the following code runs with no errors, you've succeeded!
```{r, eval=FALSE}
library(dplyr)
library(nycflights13)
library(knitr)
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does any *ONE* row in this `flights` dataset refer to?
- A. Data on an airline
- B. Data on a flight
- C. Data on an airport
- D. Data on multiple flights
**Solution**: This is data on a flight. Not a flight path! Example:
* a flight path would be United 1545 to Houston
* a flight would be United 1545 to Houston at a specific date/time. For example: 2013/1/1 at 5:15am.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some examples in this dataset of **categorical** variables? What makes them different than **quantitative** variables?
**Solution**: Hint: Type `?flights` in the console to see what all the variables mean!
* Categorical:
+ `carrier` the company
+ `dest` the destination
+ `flight` the flight number. Even though this is a number, its simply a label. Example United 1545 is not less than United 1714
* Quantitative:
+ `distance` the distance in miles
+ `time_hour` time
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What properties of the observational unit do each of `lat`, `lon`, `alt`, `tz`, `dst`, and `tzone` describe for the `airports` data frame? Note that you may want to use `?airports` to get more information.
**Solution**: `lat` `long` represent the airport geographic coordinates, `alt` is the altitude above sea level of the airport (Run `airports %>% filter(faa == "DEN")` to see the altitude of Denver International Airport), `tz` is the time zone difference with respect to GMT in London UK, `dst` is the daylight savings time zone, and `tzone` is the time zone label.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy dataset that matches these conditions.
**Solution**:
* In the `weather` example in LC2.8, the combination of `origin`, `year`, `month`, `day`, `hour` are identification variables as they identify the observation in question.
* Anything else pertains to observations: `temp`, `humid`, `wind_speed`, etc.
***
## Chapter 2 Solutions
```{r, include=FALSE, purl=FALSE}
chap <- 2
lc <- 0
# This controls which LC solutions to show. Options for solutions_shown: "ALL" (to show all solutions), or subsets of c('2-1', '2-2'), including the null vector c('') to show no solutions.
solutions_shown <- c('2-1', '2-2', '2-3', '2-4', '2-5', '2-6' ,'2-7', '2-8', '2-9', '2-10', '2-11', '2-12', '2-13', '2-14')
# solutions_shown <- c('')
show_solutions <- function(section){return(solutions_shown == "ALL" | section %in% solutions_shown)}
```
```{r message=FALSE}
library(nycflights13)
library(ggplot2)
library(dplyr)
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Take a look at both the `flights` and `alaska_flights` data frames by running `View(flights)` and `View(alaska_flights)` in the console. In what respect do these data frames differ? For example, think about the number of rows in each dataset.
**Solution**: `flights` contains all flight data, while `alaska_flights` contains only data from Alaskan carrier "AS". We can see that flights has `r nrow(flights)` rows while `alaska_flights` has only `r nrow(alaska_flights)`
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some practical reasons why `dep_delay` and `arr_delay` have a positive relationship?
**Solution**: The later a plane departs, typically the later it will arrive.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What variables in the `weather` data frame would you expect to have a negative correlation (i.e. a negative relationship) with `dep_delay`? Why? Remember that we are focusing on numerical variables here. Hint: Explore the `weather` dataset by using the `View()` function.
**Solution**: An example in the `weather` dataset is `visibility`, which measure visibility in miles. As visibility increases, we would expect departure delays to decrease.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights?
**Solution**: The point (0,0) means no delay in departure nor arrival. From the point of view of Alaska airlines, this means the flight was on time. It seems most flights are at least close to being on time.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some other features of the plot that stand out to you?
**Solution**: Different people will answer this one differently. One answer is most flights depart and arrive less than an hour late.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Create a new scatterplot using different variables in the `alaska_flights` data frame by modifying the example above.
**Solution**: Many possibilities for this one, see the plot below. Is there a pattern in departure delay depending on when the flight is scheduled to depart? Interestingly, there seems to be only two blocks of time where flights depart.
```{r, include=show_solutions('3-2'), echo=show_solutions('3-2')}
ggplot(data = alaska_flights, mapping = aes(x = dep_time, y = dep_delay)) +
geom_point()
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why is setting the `alpha` argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?
**Solution**: It thins out the points so we address overplotting. But more importantly it hints at the (statistical) **density** and **distribution** of the points: where are the points concentrated, where do they occur.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** After viewing the Figure \@ref(fig:alpha) above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the `alpha = 0.2` set in Figure \@ref(fig:noalpha)?
**Solution**: The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Take a look at both the `weather` and `early_january_weather` data frames by running `View(weather)` and `View(early_january_weather)` in the console. In what respect do these data frames differ?
**Solution**: *The rows of `early_january_weather` are a subset of `weather`.*
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** `View()` the `flights` data frame again. Why does the `time_hour` variable uniquely identify the hour of the measurement whereas the `hour` variable does not?
**Solution**: Because to uniquely identify an hour, we need the `year`/`month`/`day`/`hour` sequence, whereas there are only 24 possible `hour`'s.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis?
**Solution**: Because lines suggest connectedness and ordering.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why are linegraphs frequently used when time is the explanatory variable?
**Solution**: Because time is sequential: subsequent observations are closely related to each other.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Plot a time series of a variable other than `temp` for Newark Airport in the first 15 days of January 2013.
**Solution**: Humidity is a good one to look at, since this very closely related to the cycles of a day.
```{r, include=show_solutions('3-5'), echo=show_solutions('3-5')}
ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = humid)) +
geom_line()
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures?
**Solution**: The distribution doesn't change much. But by refining the bin width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the `temp` variabile by `View(weather)`, we see that the precision of each temperature recording is 2 decimal places.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Would you classify the distribution of temperatures as symmetric or skewed?
**Solution**: It is rather symmetric, i.e. there are no __long tails__ on only one side of the distribution
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What would you guess is the "center" value in this distribution? Why did you make that choice?
**Solution**: The center is around `r mean(weather$temp, na.rm=TRUE)`°F. By running the `summary()` command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Is this data spread out greatly from the center or is it close? Why?
**Solution**: This can only be answered relatively speaking! Let's pick things to be relative to Seattle, WA temperatures:
```{r SEATAC, out.width='40%', echo=FALSE, fig.cap="Annual temperatures at SEATAC Airport.", purl=FALSE}
knitr::include_graphics("images/SEATAC_temperatures.png")
```
While, it appears that Seattle weather has a similar center of 55°F, its
temperatures are almost entirely between 35°F and 75°F for a range of
about 40°F. Seattle temperatures are much less spread out than New York i.e.
much more consistent over the year. New York on the other hand has much colder
days in the winter and much hotter days in the summer. Expressed differently,
the middle 50% of values, as delineated by the interquartile range is 30°F:
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?
**Solution**:
* Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons.
* Because we see `temp` recordings split by `month`, we are considering the relationship between these two variables. For example, for summer months, temperatures tend to be higher.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?
**Solution**:
* They correspond to the month of the flight. While month is technically a number between 1-12, we're viewing it as a categorical variable here. Specifically, this is an **ordinal categorical** variable since there is an ordering to the categories.
* 25, 50, 75, 100 are temperatures
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.
**Solution**:
* It would not work if we had a very large number of facets. For example, if we facetted by individual days rather than months, as we would have 365 facets to look at. When considering all days in 2013, it could be argued that we shouldn't care about day-to-day fluctuation in weather so much, but rather month-to-month fluctuations, allowing us to focus on seasonal trends.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Does the `temp` variable in the `weather` dataset have a lot of variability? Why do you say that?
**Solution**: Again, like in LC `r paste0("(LC", chap, ".", (lc-4), ")")`, this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain!
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.
**Solution**: It appears to be an outlier. Let's revisit the use of the `filter` command to hone in on it. We want all data points where the `month` is 5 and `temp<25`
``` {r, include=show_solutions('3-9'), echo=show_solutions('3-9')}
weather %>%
filter(month == 5 & temp < 25)
```
There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn't the weather at least similar at EWR (Newark) and LGA (LaGuardia)?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which months have the highest variability in temperature? What reasons do you think this is?
**Solution**: We are now interested in the **spread** of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):
* The distance from the 1st to the 3rd quartiles i.e. the length of the boxes
* You can also think of this as the spread of the **middle 50%** of the data
Just from eyeballing it, it seems
* November has the biggest IQR, i.e. the widest box, so has the most variation in temperature
* August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise
Here's how we compute the exact IQR values for each month (we'll see this more in depth Chapter \@ref(wrangling) of the text):
1. `group` the observations by `month` then
1. for each `group`, i.e. `month`, `summarize` it by applying the summary statistic function `IQR()`, while making sure to skip over missing data via `na.rm=TRUE` then
1. `arrange` the table in `desc`ending order of `IQR`
```{r, echo=show_solutions('3-9'), eval=FALSE}
weather %>%
group_by(month) %>%
summarize(IQR = IQR(temp, na.rm=TRUE)) %>%
arrange(desc(IQR))
```
```{r, echo=FALSE, include=show_solutions('3-9')}
weather %>%
group_by(month) %>%
summarize(IQR = IQR(temp, na.rm=TRUE)) %>%
arrange(desc(IQR)) %>%
kable()
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** We looked at the distribution of the numerical variable `temp` split by the numerical variable `month` that we converted to a categorical variable using the `factor()` function. Why would a boxplot of `temp` split by the numerical variable `pressure` similarly converted to a categorical variable using the `factor()` not be informative?
**Solution**: Because there are 12 unique values of `month` yielding only 12 boxes in our boxplot. There are many more unique values of `pressure` (`r weather$pressure %>% unique() %>% length()` unique values in fact), because values are to the first decimal place. This would lead to `r weather$pressure %>% unique() %>% length()` boxes, which is too many for people to digest.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?
**Solution**: In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why are histograms inappropriate for visualizing categorical variables?
**Solution**: Histograms are for numerical variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the difference between histograms and barplots?
**Solution**: See above.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How many Envoy Air flights departed NYC in 2013?
**Solution**: Envoy Air is carrier code `MQ` and thus 26397 flights departed NYC in 2013.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly?
**Solution**: The answer is US, AKA U.S. Airways, with 20536 flights. However, picking out the seventh highest airline when the rows are sorted alphabetically by carrier code is difficult. This would be easier to do if the rows were sorted by number. We'll learn how to do this in Chapter \@ref(wrangling) on data wrangling.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why should pie charts be avoided and replaced by barplots?
**Solution**: In our **opinion**, comparisons using horizontal lines are easier than comparing angles and areas of circles.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is your opinion as to why pie charts continue to be used?
**Solution**: In our **opinion**, pie charts are generally considered as poorer at communicating data than bar charts. People's brains are not as good at comparing the size of angles because there is no scale, and in comparison, it is much easier to compare the heights of bars in a bar charts. However, in some circumstances, for example, when reprensenting 25% and 75% of a sample size, if we have 2 bars, in which the higher one is three times in height of the other one, it is difficult to tell the scale of their comparison without labels. But in a bar chart, it would be easy to compare if a circle is devided by 75% and 25%. (Read more at: https://www.displayr.com/why-pie-charts-are-better-than-bar-charts/)
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What kinds of questions are not easily answered by looking at the above figure?
**Solution**: Because the red, green, and blue bars don't all start at 0 (only red does), it makes comparing counts hard.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?
**Solution**: The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn't prefer airports, each color would be roughly one third of each bar.}
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case?
**Solution**: We can easily compare the different aiports for a given carrier using a single comparison line i.e. things are lined up
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general?
**Solution**: It is hard to get totals for each airline.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?
**Solution**: Not that different than using side-by-side; depends on how you want to organize your presentation.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What information about the different carriers at different airports is more easily seen in the faceted barplot?
**Solution**: Now we can also compare the different carriers **within** a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line.
***
## Chapter 3 Solutions
```{r, include=FALSE, purl=FALSE}
chap <- 3
lc <- 0
# This controls which LC solutions to show. Options for solutions_shown: "ALL" (to show all solutions), or subsets of c('2-1', '2-2'), including the null vector c('') to show no solutions.
solutions_shown <- c('3-1', '3-2', '3-3', '3-4', '3-5', '3-6', '3-7')
# solutions_shown <- c('')
show_solutions <- function(section){return(solutions_shown == "ALL" | section %in% solutions_shown)}
```
```{r message=FALSE}
library(dplyr)
library(ggplot2)
library(nycflights13)
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What's another way using the "not" operator `!` to filter only the rows that are not going to Burlington, VT nor Seattle, WA in the `flights` data frame? Test this out using the code above.
**Solution**:
```{r, eval=FALSE, echo=show_solutions('5-1'), purl=FALSE}
# Original in book
not_BTV_SEA <- flights %>%
filter(!(dest == "BTV" | dest == "SEA"))
# Alternative way
not_BTV_SEA <- flights %>%
filter(!dest == "BTV" & !dest == "SEA")
# Yet another way
not_BTV_SEA <- flights %>%
filter(dest != "BTV" & dest != "SEA")
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor's approach?
**Solution**: The missing patients may have died of lung cancer! So to ignore them might seriously **bias** your results! It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself:
* There is a systematic reasons why certain values are missing? If so, you might be biasing your results!
* If there isn't, then it might be ok to "sweep missing values under the rug."
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Modify the above `summarize` function to create `summary_temp` to also use the `n()` summary function: `summarize(count = n())`. What does the returned value correspond to?
**Solution**: It corresponds to a count of the number of observations/rows:
```{r, eval=show_solutions('5-2'), echo=show_solutions('5-2'), purl=FALSE}
weather %>%
summarize(count = n())
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why doesn't the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run `summary_temp <- weather %>% summarize(mean = mean(temp, na.rm = TRUE))` first.
```{r eval=FALSE}
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE)) %>%
summarize(std_dev = sd(temp, na.rm = TRUE))
```
**Solution**: Consider the output of only running the first two lines:
```{r, eval=show_solutions('5-2'), echo=show_solutions('5-2'), purl=FALSE}
weather %>%
summarize(mean = mean(temp, na.rm = TRUE))
```
Because after the first `summarize()`, the variable `temp` disappears as it has been collapsed to the value `mean`. So when we try to run the second `summarize()`, it can't find the variable `temp` to compute the standard deviation of.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Recall from Chapter \@ref(viz) when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the `summary_monthly_temp` data frame tell us about temperatures in New York City throughout the year?
**Solution**:
```{r, echo=FALSE, include=show_solutions('5-3'), purl=FALSE}
summary_temp_by_month <- weather %>%
group_by(month) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
kable(summary_temp_by_month) %>%
kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("hold_position"))
```
The standard deviation is a quantification of **spread** and **variability**. We
see that the period in November, December, and January has the most variation in
weather, so you can expect very different temperatures on different days.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?
**Solution**:
```{r, echo=show_solutions('5-3'), include=show_solutions('5-3'), purl=FALSE}
summary_temp_by_day <- weather %>%
group_by(year, month, day) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
summary_temp_by_day
```
Note: `group_by(day)` is not enough, because `day` is a value between 1-31. We need to `group_by(year, month, day)`
```{r, echo=show_solutions('5-2'), purl=FALSE}
library(dplyr)
library(nycflights13)
summary_temp_by_month <- weather %>%
group_by(month) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Recreate `by_monthly_origin`, but instead of grouping via `group_by(origin, month)`, group variables in a different order `group_by(month, origin)`. What differs in the resulting dataset?
**Solution**:
```{r, echo=show_solutions('5-3'), include=show_solutions('5-3'), purl=FALSE}
by_monthly_origin <- flights %>%
group_by(month, origin) %>%
summarize(count = n())
```
```{r, eval=FALSE, echo=show_solutions('5-3'), purl=FALSE}
by_monthly_origin
```
```{r, echo=FALSE, include=show_solutions('5-3'), purl=FALSE}
kable(by_monthly_origin) %>%
kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("hold_position"))
```
In `by_monthly_origin` the `month` column is now first and the rows are sorted by `month` instead of origin. If you compare the values of `count` in `by_origin_monthly` and `by_monthly_origin` using the `View()` function, you'll see that the values are actually the same, just presented in a different order.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How could we identify how many flights left each of the three airports for each `carrier`?
**Solution**: We could summarize the count from each airport using the `n()` function, which *counts rows*.
```{r, echo=show_solutions('5-3'), include=show_solutions('5-3'), purl=FALSE}
count_flights_by_airport <- flights %>%
group_by(origin, carrier) %>%
summarize(count=n())
```
```{r, eval=FALSE, include=show_solutions('5-3'), purl=FALSE}
count_flights_by_airport
```
```{r, echo=FALSE, include=show_solutions('5-3'), purl=FALSE}
kable(count_flights_by_airport) %>%
kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
latex_options = c("hold_position"))
```
All remarkably similar! Note: the `n()` function counts rows, whereas the `sum(VARIABLE_NAME)` funciton sums all values of a certain numerical variable `VARIABLE_NAME`.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How does the `filter` operation differ from a `group_by` followed by a `summarize`?
**Solution**:
* `filter` picks out rows from the original dataset without modifying them, whereas
* `group_by %>% summarize` computes summaries of numerical variables, and hence
reports new values.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What do positive values of the `gain` variable in `flights` correspond to? What about negative values? And what about a zero value?
**Solution**:
* Say a flight departed 20 minutes late, i.e. `dep_delay = 20`
* Then arrived 10 minutes late, i.e. `arr_delay = 10`.
* Then `gain = dep_delay - arr_delay = 20 - 10 = 10` is positive, so it "made up/gained time in the air."
* 0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the `gain` is near 0 minutes.
* I never understood this. If the pilot says "we're going make up time in the air"
because of delay by flying faster, why don't you always just fly faster to begin
with?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Could we create the `dep_delay` and `arr_delay` columns by simply subtracting `dep_time` from `sched_dep_time` and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in `flights`.
**Solution**: No because you can't do direct arithmetic on times. The difference in time between 12:03 and 11:59 is 4 minutes, but `1203-1159 = 44`
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What can we say about the distribution of `gain`? Describe it in a few sentences using the plot and the `gain_summary` data frame values.
**Solution**: Most of the time the gain is a little under zero, most of the time the gain is between -50 and 50 minutes. There are some extreme cases however!
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Looking at Figure \@ref(fig:reldiagram), when joining `flights` and `weather` (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of `year`, `month`, `day`, `hour`, and `origin`, and not just `hour`?
**Solution**: Because `hour` is simply a value between 0 and 23; to identify a *specific* hour, we need to know which year, month, day and at which airport.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What surprises you about the top 10 destinations from NYC in 2013?
**Solution**: This question is subjective! What surprises me is the high number of flights to Boston. Wouldn't it be easier and quicker to take the train?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some advantages of data in normal forms? What are some disadvantages?
**Solution**: When datasets are in normal form, we can easily `_join` them with other datasets! For example, we can join the `flights` data with the `planes` data.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some ways to select all three of the `dest`, `air_time`, and `distance` variables from `flights`? Give the code showing how to do this in at least three different ways.
**Solution**:
```{r, echo=show_solutions('5-6'), include=show_solutions('5-6'), purl=FALSE}
# The regular way:
flights %>%
select(dest, air_time, distance)
# Since they are sequential columns in the dataset
flights %>%
select(dest:distance)
# Not as effective, by removing everything else
flights %>%
select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time,
-sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin,
-hour, -minute, -time_hour)
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How could one use `starts_with`, `ends_with`, and `contains` to select columns from the `flights` data frame? Provide three different examples in total: one for `starts_with`, one for `ends_with`, and one for `contains`.
**Solution**:
```{r, echo=show_solutions('5-6'), include=show_solutions('5-6'), purl=FALSE}
# Anything that starts with "d"
flights %>%
select(starts_with("d"))
# Anything related to delays:
flights %>%
select(ends_with("delay"))
# Anything related to departures:
flights %>%
select(contains("dep"))
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why might we want to use the `select()` function on a data frame?
**Solution**: To narrow down the data frame, to make it easier to look at. Using `View()` for example.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.
**Solution**:
```{r, echo=show_solutions('5-6'), include=show_solutions('5-6'), purl=FALSE}
top_five <- flights %>%
group_by(dest) %>%
summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay)) %>%
top_n(n = 5)
top_five
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Using the datasets included in the `nycflights13` package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints:
1. **Crucial**: Unless you are very confident in what you are doing, it is worthwhile to not starting coding right away, but rather first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level *pseudocode* that is informal yet detailed enough to articulate what you are doing. This way you won't confuse *what* you are trying to do (the algorithm) with *how* you are going to do it (writing `dplyr` code).
1. Take a close look at all the datasets using the `View()` function: `flights`, `weather`, `planes`, `airports`, and `airlines` to identify which variables are necessary to compute available seat miles.
1. Figure \@ref(fig:reldiagram) above showing how the various datasets can be joined will also be useful.
1. Consider the data wrangling verbs in Table \@ref(tab:wrangle-summary-table) as your toolbox!
**Solution**: Here are some examples of student-written [pseudocode](https://twitter.com/rudeboybert/status/964181298691629056). Based on our own pseudocode, let's first display the entire solution.
```{r, echo=show_solutions('5-7'), include=show_solutions('5-7'), purl=FALSE}
flights %>%
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
mutate(ASM = seats * distance) %>%
group_by(carrier) %>%
summarize(ASM = sum(ASM, na.rm = TRUE)) %>%
arrange(desc(ASM))
```
Let's now break this down step-by-step. To compute the available seat miles for a given flight, we need the `distance` variable from the `flights` data frame and the `seats` variable from the `planes` data frame, necessitating a join by the key variable `tailnum` as illustrated in Figure \@ref(fig:reldiagram). To keep the resulting data frame easy to view, we'll `select()` only these two variables and `carrier`:
```{r, echo=show_solutions('5-7'), include=show_solutions('5-7'), purl=FALSE}
flights %>%
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance)
```
Now for each flight we can compute the available seat miles `ASM` by multiplying the number of seats by the distance via a `mutate()`:
```{r, echo=show_solutions('5-7'), include=show_solutions('5-7'), purl=FALSE}
flights %>%
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
# Added:
mutate(ASM = seats * distance)
```
Next we want to sum the `ASM` for each carrier. We achieve this by first grouping by `carrier` and then summarizing using the `sum()` function:
```{r, echo=show_solutions('5-7'), include=show_solutions('5-7'), purl=FALSE}
flights %>%
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
mutate(ASM = seats * distance) %>%
# Added:
group_by(carrier) %>%
summarize(ASM = sum(ASM))
```
However, because for certain carriers certain flights have missing `NA` values, the resulting table also returns `NA`'s. We can eliminate these by adding a `na.rm = TRUE` argument to `sum()`, telling R that we want to remove the `NA`'s in the sum. We saw this in Section \@ref(summarize):
```{r, echo=show_solutions('5-7'), include=show_solutions('5-7'), purl=FALSE}
flights %>%
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
mutate(ASM = seats * distance) %>%
group_by(carrier) %>%
# Modified:
summarize(ASM = sum(ASM, na.rm = TRUE))
```
Finally, we `arrange()` the data in `desc()`ending order of `ASM`.
```{r, echo=show_solutions('5-7'), include=show_solutions('5-7'), purl=FALSE}
flights %>%
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
mutate(ASM = seats * distance) %>%
group_by(carrier) %>%
summarize(ASM = sum(ASM, na.rm = TRUE)) %>%
# Added:
arrange(desc(ASM))
```
While the above data frame is correct, the IATA `carrier` code is not always useful. For example, what carrier is `WN`? We can address this by joining with the `airlines` dataset using `carrier` is the key variable. While this step is not absolutely required, it goes a long way to making the table easier to make sense of. It is important to be empathetic with the ultimate consumers of your presented data!
```{r, echo=show_solutions('5-7'), include=show_solutions('5-7'), purl=FALSE}
flights %>%
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
mutate(ASM = seats * distance) %>%
group_by(carrier) %>%
summarize(ASM = sum(ASM, na.rm = TRUE)) %>%
arrange(desc(ASM)) %>%
# Added:
inner_join(airlines, by = "carrier")
```
***
## Chapter 4 Solutions
```{r, include=FALSE, purl=FALSE}
chap <- 4
lc <- 0
# This controls which LC solutions to show. Options for solutions_shown: "ALL" (to show all solutions), or subsets of c('2-1', '2-2'), including the null vector c('') to show no solutions.
solutions_shown <- c('4-1', '4-2', '4-3', '4-4')
# solutions_shown <- c('')
show_solutions <- function(section){return(solutions_shown == "ALL" | section %in% solutions_shown)}
```
```{r message=FALSE}
library(dplyr)
library(ggplot2)
library(nycflights13)
library(tidyr)
library(readr)
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are common characteristics of "tidy" datasets?
**Solution**: Rows correspond to observations, while columns correspond to variables.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What makes "tidy" datasets useful for organizing data?
**Solution**: Tidy datasets are an organized way of viewing data. This format is required for the `ggplot2` and `dplyr` packages for data visualization and wrangling.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Take a look the `airline_safety` data frame included in the `fivethirtyeight` data. Run the following:
```{r, eval=FALSE}
airline_safety
```
After reading the help file by running `?airline_safety`, we see that `airline_safety` is a data frame containing information on different airlines companies' safety records. This data was originally reported on the data journalism website FiveThirtyEight.com in Nate Silver's article ["Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?"](https://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/). Let's ignore the `incl_reg_subsidiaries` and `avail_seat_km_per_week` variables for simplicity:
```{r}
airline_safety_smaller <- airline_safety %>%
select(-c(incl_reg_subsidiaries, avail_seat_km_per_week))
airline_safety_smaller
```
This data frame is not in "tidy" format. How would you convert this data frame to be in "tidy" format, in particular so that it has a variable `incident_type_years` indicating the indicent type/year and a variable `count` of the counts?
**Solution**:
This can been done using the `pivot_longer()` function from the `tidyr` package:
```{r}
airline_safety_smaller_tidy <- airline_safety_smaller %>%
pivot_longer(names_to = "incident_type_years",
values_to = "count",
cols = -airline)
airline_safety_smaller_tidy
```
If you look at the resulting `airline_safety_smaller_tidy` data frame in the spreadsheet viewer, you'll see that the variable `incident_type_years` has 6 possible values: `"incidents_85_99", "fatal_accidents_85_99", "fatalities_85_99",
"incidents_00_14", "fatal_accidents_00_14", "fatalities_00_14"` corresponding to the 6 columns of `airline_safety_smaller` we tidied.
Note that prior to `tidyr` version 1.0.0 released to CRAN in September 2019, this could also have been done using the `gather()` function from the `tidyr` package:
```{r}
airline_safety_smaller_tidy <- airline_safety_smaller %>%
gather(key = incident_type_years, value = count, -airline)
airline_safety_smaller_tidy
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Convert the `dem_score` data frame into
a tidy data frame and assign the name of `dem_score_tidy` to the resulting long-formatted data frame.
**Solution**: Running the following in the console:
```{r, include=show_solutions('4-3')}
dem_score_tidy <- dem_score %>%
pivot_longer(names_to = "year", values_to = "democracy_score",
cols = -country)
# gather(key = year, value = democracy_score, - country)
```
Let's now compare the `dem_score` and `dem_score_tidy`. `dem_score` has democracy score information for each year in columns, whereas in `dem_score_tidy` there are explicit variables `year` and `democracy_score`. While both representations of the data contain the same information, we can only use `ggplot()` to create plots using the `dem_score_tidy` data frame.
```{r, include=show_solutions('4-3')}
dem_score
dem_score_tidy
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Read in the life expectancy data stored at <https://moderndive.com/data/le_mess.csv> and convert it to a tidy data frame.
**Solution**: The code is similar
```{r, eval=FALSE,include=show_solutions('4-3'), echo=show_solutions('4-3')}
life_expectancy <- read_csv("https://moderndive.com/data/le_mess.csv")
life_expectancy_tidy <- life_expectancy %>%
pivot_longer(names_to = "year",
values_to = "life_expectancy",
cols = -country)
# gather(key = year, value = life_expectancy, -country)
```
```{r, echo=FALSE, purl=FALSE, message=FALSE, warning=FALSE}
life_expectancy <- read_csv('data/le_mess.csv')
life_expectancy_tidy <- life_expectancy %>%
pivot_longer(names_to = "year",
values_to = "life_expectancy",
cols = -country)
# gather(key = year, value = life_expectancy, -country)
```
We observe the same construct structure with respect to `year` in `life_expectancy` vs `life_expectancy_tidy` as we did in `dem_score` vs `dem_score_tidy`:
```{r, lc4-2solutions-4, include=show_solutions('4-3')}
life_expectancy
life_expectancy_tidy
```
***
## Chapter 5 Solutions
To come!
```{r, include=FALSE, purl=FALSE}
chap <- 5
lc <- 0
# This controls which LC solutions to show. Options for solutions_shown: "ALL" (to show all solutions), or subsets of c('2-1', '2-2'), including the null vector c('') to show no solutions.
solutions_shown <- c('5-1', '5-2', '5-3', '5-4', '5-5', '5-6', '5-7')
# solutions_shown <- c('')
show_solutions <- function(section){return(solutions_shown == "ALL" | section %in% solutions_shown)}
```
```{r message=FALSE}
library(ggplot2)
library(dplyr)
library(moderndive)
library(gapminder)
#library(skimr)
```