-
Notifications
You must be signed in to change notification settings - Fork 8
/
1.1-prerequisites.Rmd
655 lines (427 loc) · 54.7 KB
/
1.1-prerequisites.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
# Basics of statistics {#basics}
This chapter introduces some important terms useful for doing data analyses.
It also introduces the essentials of the classical frequentist tests such as t-test. Even though we will not use nullhypotheses tests later [@Amrhein.2019], we introduce them here because we need to understand the scientific literature. For each classical test, we provide a suggestion how to present the statistical results without using null hypothesis tests. We further discuss some differences between the Bayesian and frequentist statistics.
## Variables and observations
Empirical research involves data collection. Data are collected by recording measurements of variables for observational units. An observational unit may be, for example, an individual, a plot, a time interval or a combination of those. The collection of all units ideally build a random sample of the entire population of units in that we are interested. The measurements (or observations) of the random sample are stored in a data table (sometimes also called data set, but a data set may include several data tables. A collection of data tables belonging to the same study or system is normally bundled and stored in a data base). A data table is a collection of variables (columns). Data tables normally are handled as objects of class `data.frame` in R. All measurements on a row in a data table belong to the same observational unit. The variables can be of different scales (Table \@ref(tab:scalemeasurement)).
Table: (\#tab:scalemeasurement) Scales of measurements
Scale | Examples | Properties | Coding in R |
:-------|:------------------|:------------------|:--------------------|
Nominal | Sex, genotype, habitat | Identity (values have a unique meaning) | `factor()` |
Ordinal | Elevational zones | Identity and magnitude (values have an ordered relationship) | `ordered()` |
Numeric | Discrete: counts; continuous: body weight, wing length | Identity, magnitude, and intervals or ratios | `intgeger()` `numeric()` |
The aim of many studies is to describe how a variable of interest ($y$) is related to one or more predictor variables ($x$). How these variables are named differs between authors. The y-variable is called "outcome variable", "response" or "dependent variable". The x-variables are called "predictors", "explanatory variables" or "independent variables". The choose of the terms for x and y is a matter of taste. We avoid the terms "dependent" and "independent" variables because often we do not know whether the variable $y$ is in fact depending on the $x$ variables and also, often the x-variables are not independent of each other. In this book, we try to use "outcome" and "predictor" variables because these terms sound most neutral to us in that they refer to how the statistical model is constructed rather than to a real life relationship.
## Displaying and summarizing data
### Histogram
While nominal and ordinal variables are summarized by giving the absolute number or the proportion of observations for each category, numeric variables normally are summarized by a location and a scatter statistics, such as the mean and the standard deviation or the median and some quantiles. The distribution of a numeric variable is graphically displayed in a histogram (Fig. \@ref(fig:histogram)).
```{r histogram, echo=FALSE, fig.cap='Histogram of the length of ell of statistics course participants.'}
load("RData/datacourse.RData")
hist(dat$ell, las=1, xlab="Lenght of ell [cm]", ylab="Number of students", main=NA,
col="tomato")
box()
```
To draw a histogram, the variable is displayed on the x-axis and the $x_i$-values are assigned to classes. The edges of the classes are called ‘breaks’. They can be set with the argument `breaks=` within the function `hist`. The values given in the `breaks=` argument must at least span the values of the variable. If the argument `breaks=` is not specified, R searches for breaks-values that make the histogram look smooth. The number of observations falling in each class is given on the y-axis. The y-axis can be re-scaled so that the area of the histogram equals 1 by setting the argument `density=TRUE`. In that case, the values on the y-axis correspond to the density values of a probability distribution (Chapter \@ref(distributions)).
### Location and scatter
Location statistics are mean, median or mode. A common mean is the
- arithmetic mean: $\hat{\mu} = \bar{x} = \frac{i=1}{n} x_i \sum_{1}^{n}$ (R function `mean`),
where $n$ is the sample size. The parameter $\mu$ is the (unknown) true mean of the entire population of which the $1,...,n$ measurements are a random sample of. $\bar{x}$ is called the sample mean and used as an estimate for $\mu$. The $^$ above any parameter indicates that the parameter value is obtained from a sample and, therefore, it may be different from the true value.
The median is the 50% quantile. We find 50% of the measurements below and 50% above the median. If $x_1,..., x_n$ are the ordered measurements of a variable, then the median is:
- median $= x_{(n+1)/2}$ for uneven $n$, and median $= \frac{1}{2}(x_{n/2} + x_{n/2+1})$ for even $n$ (R function `median`).
The mode is the value that is occurring with highest frequency or that has the highest density.
Scatter also is called spread, scale or variance. Variance parameters describe how far away from the location parameter single observations can be found, or how the measurements are scattered around their mean. The variance is defined as the average squared difference between the observations and the mean:
- variance $\hat{\sigma^2} = s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2$
The term $(n-1)$ is called the degrees of freedom. It is used in the denominator of the variance formula instead of $n$ to prevent underestimating the variance. Because $\bar{x}$ is in average closer to $x_i$ than the unknown true mean $\mu$ would be, the variance would be underestimated if $n$ is used in the denominator.
<!-- <font size="1"> The maximum likelihood estimate of the variance corresponds to the variance formula using $n$ instead of $n-1$ in the denominator, see, e.g., @Royle.2008b.</font> -->
The variance is used to compare the degree of scatter among different groups. However, its values are difficult to interpret because of the squared unit. Therefore, the square root of the variance, the standard deviation is normally reported:
- standard deviation $\hat{\sigma} = s = \sqrt{s^2}$ (R Function `sd`)
The standard deviation is approximately the average deviation of an observation from the sample mean. In the case of a [normal distribution](#normdist), about two thirds (68%) of the data are expected within one standard deviation around the mean.
The variance and standard deviation each describe the scatter with a single value. Thus, we have to assume that the observations are scattered symmetrically around their mean in order to get a picture of the distribution of the measurements. When the measurements are spread asymmetrically (skewed distribution), then it may be more precise to describe the scatter with more than one value. Such statistics could be quantiles from the lower and upper tail of the data.
Quantiles inform us about both location and spread of a distribution. The $p$th-quantile is the value with the property that a proportion $p$ of all values are less than or equal to the value of the quantile. The median is the 50% quantile. The 25% quantile and the 75% quantile are also called the lower and upper quartiles, respectively. The range between the 25% and the 75% quartiles is called the interquartile range. This range includes 50% of the distribution and is also used as a measure of scatter. The R function `quantile` extracts sample quantiles. The median, the quartiles, and the interquartile range can be graphically displayed using box and-whisker plots (boxplots in short, R function `boxplot`). The horizontal fat bars are the medians (Fig. \@ref(fig:boxplot)). The boxes mark the interquartile range. The whiskers reach out to the last observation within 1.5 times the interquartile range from the quartile. Circles mark observations beyond 1.5 times the interquartile range from the quartile.
```{r boxplot, fig.cap="Boxplot of the length of ell of statistics course participants who are or ar not owner of a car."}
par(mar=c(5,4,1,1))
boxplot(ell~car, data=dat, las=1, ylab="Lenght of ell [cm]",
col="tomato", xaxt="n")
axis(1, at=c(1,2), labels=c("Not owing a car", "Car owner"))
n <- table(dat$car)
axis(1, at=c(1,2), labels=paste("n=", n, sep=""), mgp=c(3,2, 0))
```
The boxplot is an appealing tool for comparing location, variance and distribution of measurements among groups.
### Correlations
A correlation measures the strength with which characteristics of two variables are associated with each other (co-occur). If both variables are numeric, we can visualize the correlation using a scatterplot.
```{r scatterplot, fig.cap="Scatterplot of the length of ell and the comfort temperature of statistics course participants."}
par(mar=c(5,4,1,1))
plot(temp~ell, data=dat, las=1, xlab="Lenght of ell [cm]",
ylab="Comfort temperature [°C]",
pch=16)
```
The covariance between variable $x$ and $y$ is defined as:
- covariance $q = \frac{1}{n-1}\sum_{i=1}^{n}((x_i-\bar{x})*(y_i-\bar{y}))$ (R function `cov`)
As for the variance, also the units of the covariance are sqared and therefore covariance values are difficult to interpret. A standardized covariance is the Pearson correlation coefficient:
- Pearson correlation coefficient: $r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}$ (R function `cor`)
Means, variances, standard deviations, covariances and correlations are sensible for outliers. Single observations containing extreme values normally have a overproportional influence on these statistics. When outliers are present in the data, we may prefer a more robust correlation measure such as the Spearman correlation or Kendall’s tau. Both are based on the ranks of the measurements instead of the measurements themselves.
- Spearman correlation coefficient: correlation between rank(x) and rank(y) (R function `cor(x,y, method="spearman")`)
- Kendall's tau: $\tau = 1-\frac{4I}{(n(n-1))}$, where $I$ = number of pairs $(i,k)$ for which $(x_i < x_k)$ & $(y_i > y_k)$ or viceversa. (R function `cor(x,y, method="kendall")`)
### Principal components analyses PCA
The principal components analysis (PCA) is a multivariate correlation analysis. A multidimensional data set with $k$ variables can be seen as a cloud of points (observations) in a $k$-dimensional space. Imagine, we could move around in the space and look at the cloud from different locations. From some locations, the data looks highly correlated, whereas from others, we cannot see the correlation. That is what PCA is doing. It is rotating the coordinate system (defined by the original variables) of the data cloud so that the correlations are no longer visible. The axes of the new coordinates system are linear combinations of the original variables. They are called principal components. There are as many principal coordinates as there are original variables, i.e. $k$, $p_1, ..., p_k$. The principal components meet further requirements:
- the first component explains most variance
- the second component explains most of the remaining variance and is perpendicular (= uncorrelated) to the first one
- third component explains most of the remaining variance and is perpendicular to the first two
- ...
For example, in a two-dimensional data set $(x_1, x_2)$ the principal components become
$pc_{1i} = b_{11}x_{1i} + b_{12}x_{2i}$
$pc_{2i} = b_{21}x_{1i} + b_{22}x_{2i}$ with $b_{jk}$ being loadings of principal component $j$ and original variable $k$. Fig. \@ref(fig:principal) shows the two principal components for a two-dimensional data set. They can be calculated using matrix algebra: principal components are eigenvectors of the covariance or correlation matrix.
```{r principal, echo=FALSE, fig.cap='Principal components of a two dimensional data set based on the covariance matrix (green) and the correlation matrix (brown).'}
set.seed(2353)
x1 <- rnorm(100, 5, 1)
x2 <- rnorm(100, 0.8*x1, 1)
par(mar=c(3,3,0.3,0.3))
plot(x1-mean(x1),x2-mean(x2), asp=1, pch=16, col="blue")
# PCA based on covariance matrix
pc1 <- eigen(cov(cbind(x1,x2)))$vectors[,1]
pc2 <- eigen(cov(cbind(x1,x2)))$vectors[,2]
abline(0, pc1[2]/pc1[1], col="green", lwd=2)
abline(0, pc2[2]/pc2[1], col="green", lwd=2)
# PCA based on correlation matrix
pc1 <- eigen(cor(cbind(x1,x2)))$vectors[,1]
pc2 <- eigen(cor(cbind(x1,x2)))$vectors[,2]
abline(0, pc1[2]/pc1[1], col="brown", lwd=2)
abline(0, pc2[2]/pc2[1], col="brown", lwd=2)
```
The choice between correlation or covariance matrix is essential and important. The covariance matrix is an unstandardized correlation matrix. Therefore, the units, i.e., whether cm or m are used, influence the results of the PCA if it is based on the covariance matrix. When the PCA is based on the covariance matrix, the results will change, when we change the units of one variable, e.g., from cm to m. Basing the PCA on the covariance matrix only makes sense, when the variances are comparable among the variables, i.e., if all variables are measured in the same unit and we would like to weight each variable according to its variance. If this is not the case, the PCA must be based on the correlation matrix.
```{r}
pca <- princomp(cbind(x1,x2)) # PCA based on covariance matrix
pca <- princomp(cbind(x1,x2), cor=TRUE) # PCA based on correlation matrix
loadings(pca)
```
The loadings measure the correlation of each variable with the principal components. They inform about what aspects of the data each component is measuring. The signs of the loadings are arbitrary, thus we can multiplied them by -1 without changing the PCA. Sometimes this can be handy for describing the meaning of the principal component in a paper. For example, @Zbinden.2018 combined the number of hunting licenses, the duration of the hunting period and the number of black grouse cocks that were allowed to be hunted per hunter in a principal component in order to measure hunting pressure. All three variables had a negative loading in the first component, so that high values of the component meant low hunting pressure. Before the subsequent analyses, for which a measure of hunting pressure was of interest, the authors changed the signs of the loadings so that this component measured hunting pressure.
The proportion of variance explained by each component is, beside the loadings, an important information. If the first few components explain the main part of the variance, it means that maybe not all $k$ variables are necessary to describe the data, or, in other words, the original $k$ variables contain a lot of redundant information.
```{r}
# extract the variance captured by each component
summary(pca)
```
<font size="1">Ridge regression is similar to doing a PCA within a linear model while components with low variance are shrinked to a higher degree than components with a high variance.</font>
## Inferential statistics
### Uncertainty
> there is never a "yes-or-no" answer
> there will always be uncertainty
Amrhein (2017)[https://peerj.com/preprints/26857]
The decision whether an effect is important or not cannot not be done based on data alone. For making a decision we should, beside the data, carefully consider the consequences of each decision, the aims we would like to achieve, and the risk, i.e. how bad it is to make the wrong decision. Structured decision making or decision analyses provide methods to combine consequences of decisions, objectives of different stakeholders and risk attitudes of decision makers to make optimal decisions [@Hemming2022, Runge2020]. In most data analyses, particularly in basic research and when working on case studies, we normally do not consider consequences of decisions. However, the results will be more useful when presented in a way that other scientists can use them for a meta-analysis, or stakeholders and politicians can use them for making better decisions. Useful results always include information on the size of a parameter of interest, e.g. an effect of a drug or an average survival, together with an uncertainty measure.
Therefore, statistics is describing patterns of the process that presumably has generated the data and quantifying the uncertainty of the described patterns that is due to the fact that the data is just a random sample from the larger population we would like to know the patterns of.
Quantification of uncertainty is only possible if:
1. the mechanisms that generated the data are known
2. the observations are a random sample from the population of interest
Most studies aim at understanding the mechanisms that generated the data, thus they are most likely not known beforehand. To overcome that problem, we construct models, e.g. statistical models, that are (strong) abstractions of the data generating process. And we report the model assumptions. All uncertainty measures are conditional on the model we used to analyze the data, i.e., they are only reliable, if the model describes the data generating process realistically. Because most statistical models do not describe the data generating process well, the true uncertainty almost always is much higher than the one we report.
In order to obtain a random sample from the population under study, a good study design is a prerequisite.
To illustrate how inference about a big population is drawn from a small sample, we here use simulated data. The advantage of using simulated data is that the mechanism that generated the data is known as well as the big population.
Imagine there are 300000 PhD students on the world and we would like to know how many statistics courses they have taken in average before they started their PhD (Fig. \@ref(fig:histtruesample)). We use random number generators (`rpois` and `rgamma`) to simulate for each of the 300000 virtual students a number. We here use these 300000 numbers as the big population that in real life we almost never can sample in total. Normally, we know the number of courses students have taken just for a small sample of students. To simulate that situation we draw 12 numbers at random from the 300000 (R function `sample`). Then, we estimate the average number of statistics courses students take before they start a PhD from the sample of 12 students and we compare that mean to the true mean of the 300000 students.
```{r}
# simulate the virtual true population
set.seed(235325) # set seed for random number generator
# simulate fake data of the whole population
# using an overdispersed Poisson distribution,
# i.e. a Poisson distribution of whicht the mean
# has a gamma distribution
statscourses <- rpois(300000, rgamma(300000, 2, 3))
# draw a random sample from the population
n <- 12 # sample size
y <- sample(statscourses, 12, replace=FALSE)
```
```{r histtruesample, fig.cap='Histogram of the number of statistics courses of 300000 virtual PhD students have taken before their PhD started. The rugs on the x-axis indicate the random sample of 12 out of the 300000 students. The black vertical line indicates the mean of the 300000 students (true mean) and the blue line indicates the mean of the sample (sample mean).', echo=FALSE}
par(mar=c(4,5,1,1))
# draw a histogram of the 300000 students
hist(statscourses, breaks=seq(-0.5, 10.5, by=1), main=NA,
xlab="Number of statistics courses", ylab="Number of students")
box(bty="l")
# add the sample
rug(jitter(y))
# add the mean of the sample
abline(v=mean(y), col="blue", lwd=2)
# add the true mean of the population
abline(v=mean(statscourses), lwd=2)
text(4, 150000, "Sample mean", col="blue", adj=c(0,1))
text(4, 120000, "True mean", adj=c(0,1))
```
We observe the sample mean, what do we know about the population mean? There are two different approaches to answer this question. 1) We could ask us, how much the sample mean would scatter, if we repeat the study many times? This approach is called the frequentist' statistics. 2) We could ask us for any possible value, what is the probability that it is the true population mean? To do so, we use probability theory and that is called the Bayesian statistics.
Both approaches use (essentially similar) models. Only the mathematical techniques to calculate uncertainty measures differ between the two approaches. In cases when beside the data no other information is used to construct the model, then the results are approximately identical (at least for large enough sample sizes).
A frequentist 95% confidence interval (blue horizontal segment in Fig. \@ref(fig:CImean)) is constructed such that, if you were to (hypothetically) repeat the experiment or sampling many many times, 95% of the intervals constructed would contain the true value of the parameter (here the mean number of courses). From the Bayesian posterior distribution (pink in Fig. \@ref(fig:CImean)) we could construct a 95% interval (e.g., by using the 2.5% and 97.5% quantiles). This interval has traditionally been called credible interval. It can be interpreted that we are 95% sure that the true mean is inside that interval.
Both, confidence interval and posterior distribution, correspond to the statistical uncertainty of the sample mean, i.e., they measure how far away the sample mean could be from the true mean. In this virtual example, we know the true mean is `r round(mean(statscourses),2)`, thus somewhere at the lower part of the 95% CI or in the lower quantiles of the posterior distribution. In real life, we do not know the true mean. The grey histogram in Fig. \@ref(fig:CImean) shows how means of many different virtual samples of 12 students scatter around the true mean. The 95% interval of these virtual means corresponds to the 95% CI, and the variance of these virtual means correspond to the variance of the posterior distribution. This virtual example shows that posterior distribution and 95% CI correctly measure the statistical uncertainty (variance, width of the interval), however we never know exactly how far the sample mean is from the true mean.
```{r CImean, fig.cap='Histogram of means of repeated samples from the true populations. The scatter of these means visualize the true uncertainty of the mean in this example. The blue vertical line indicates the mean of the original sample. The blue segment shows the 95% confidence interval (obtained by fequensist methods) and the violet line shows the posterior distribution of the mean (obtained by Bayesian methods).',echo=FALSE, message=FALSE}
library(arm)
nsim <- 10000
# frequentist theoretical uncertainty of mean
# we draw many new samples from the true population
my <- numeric(nsim)
for(i in 1:nsim) my[i] <- mean(sample(statscourses, 12, replace=FALSE))
# Bayesian uncertainty of the mean
mod <- lm(y~1) # define the model (Normal(mu, sigma2))
bsim <- sim(mod, n.sim=nsim) # apply Bayes theorem
# frequentist uncertainty of the mean
mean_freqCI <- predict(mod, interval="confidence")[1,]
par(mar=c(4,4,0.1,6), mgp=c(2,0.5,0), tck=-0.01)
hist(my, main=NA, xlab="Mean(y)", freq=FALSE, ylim=c(0, 2.5), col="grey",
cex.axis=0.8, las=1)
bm <- density(bsim@coef[,1])
lines(bm$x, bm$y, lwd=2, col="violet")
abline(v= coef(mod), lwd=2, col="blue")
segments(mean_freqCI["lwr"], 0, mean_freqCI["upr"], lwd=3, col="blue")
text(0, 2.5, "True repetitions of the study", adj=c(0,1), cex=0.9, xpd=NA)
text(1.2, 2.2, "Sample mean with 95% CI", col="blue", adj=c(0,1), cex=0.9, xpd=NA)
text(1.2, 1.5, "Posterior distribution of the mean", col="violet", adj=c(0,1), cex=0.9, xpd=NA)
```
Uncertainty intervals only are reliable if the model is a realistic abstraction of the data generating process (or if the model assumptions are realistic).
Because both terms, confidence and credible interval, suggest that the interval indicates confidence or credibility but the intervals actually show uncertainty, it has been suggested to rename the interval into compatibility or uncertainty interval [@Gelman.2019].
### Standard error
The standard error SE is, beside the uncertainty interval, an alternative possibility to measure uncertainty. It measures an average deviation of the sample mean from the (unknown) true population mean. The frequentist method for obtaining the SE is based on the central limit theorem. According to the central limit theorem the sum of independent, not necessarily normally distributed random numbers are normally distributed when sample size is large enough (Chapter \@ref(distributions)). Because the mean is a sum (divided by a constant, the sample size) it can be assumed that the distribution of many means of samples is normal. The standard deviation SD of the many means is called the standard error SE. It can be mathematically shown that the standard error SE equals the standard deviation SD of the sample divided by the square root of the sample size:
- frequentist SE = SD/sqrt(n) = $\frac{\hat{\sigma}}{\sqrt{n}}$
- Bayesian SE: Using Bayesian methods, the SE is the SD of the posterior distribution.
It is very important to keep the difference between SE and SD in mind! SD measures the scatter of the data, whereas SE measures the statistical uncertainty of the mean (or of another estimated parameter, Fig. \@ref(fig:sesd)). SD is a descriptive statistics describing a characteristics of the data, whereas SE is an inferential statistics showing us how far away the sample mean possibly is from the true mean. When sample size increases, SE becomes smaller, whereas SD does not change (given the added observations are drawn at random from the same big population as the ones already in the sample).
```{r sesd, fig.cap='Illustration of the difference between SD and SE. The SD measures the scatter in the data (sample, tickmarks on the x-axis). The SD is an estimate for the scatter in the big population (grey histogram, normally not known). The SE measures the uncertainty of the sample mean (in blue). The SE measures approximately how far, in average the sample mean (blue) is from the true mean (brown).',echo=FALSE, message=FALSE}
par(mar=c(4,4,0.1,6), mgp=c(2,0.5,0), tck=-0.01, cex.lab=0.8, cex.axis=0.7)
hist(statscourses, breaks=seq(-0.5, 10.5, by=1), main=NA, freq=FALSE, ylim=c(0, 0.8),
las=1) # draw histogram of the population
rug(jitter(y), col="black", lwd=2) # add the sample
#hist(my, add=TRUE, col=rgb(0,0,0,0.5), freq=FALSE)
abline(v= coef(mod), lwd=2, col="blue")
#abline(v= coef(mod), lwd=2, lty=2, col="green")
segments(coef(mod)-sd(y), 0, coef(mod)+sd(y),
lwd=5, col="green", lend="butt")
segments(coef(mod)-summary(mod)$coef[2], 0, coef(mod)+summary(mod)$coef[2],
lwd=5, col="blue", lend="butt")
abline(v=mean(statscourses), col="brown")
#segments(mean(statscourses)-sd(statscourses), 1.5, mean(statscourses)+sd(statscourses), 1.5)
#segments(mean(statscourses)-sd(my), 1.49, mean(statscourses)+sd(my), 1.49, lwd=3)
#text(2, 1.5, "True mean with SD and SE", adj=c(0,1), cex=0.9, xpd=NA)
text(2, 0.2, "Estimated SD around the mean", adj=c(0,1), col="green", cex=0.9, xpd=NA)
text(2, 0.3, "Estimated mean with SE", adj=c(0,1), col="blue", cex=0.9, xpd=NA)
box()
```
## Bayes theorem and the common aim of frequentist and Bayesian methods
### Bayes theorem for discrete events
```{r, echo=FALSE}
tab <- table(dat$car, dat$birthday)
```
The Bayes theorem describes the probability of event A conditional on event B (the probability of A after B has already occurred) from the probability of B conditional on A and the two probabilities of the events A and B:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
Imagine, event A is "The person likes wine as a birthday present." and event B "The person has no car.". The conditional probability of A given B is the probability that a person not owing a car likes wine. Answers from students whether they have a car and what they like as a birthday present are summarized in Table \@ref(tab:winecar).
Table: (\#tab:winecar) Cross table of the student's birthday preference and car ownership.
car/birthday | flowers | wine | **sum** |
:------|:---------------|:---------------|:------------------|
no car | `r tab[1,1]` | `r tab[1,2]` | **`r sum(tab[1,])`** |
car | `r tab[2,1]` | `r tab[2,2]` | **`r sum(tab[2,])`** |
**sum** | **`r sum(tab[,1])`**| **`r sum(tab[,2])`**| **`r sum(tab)`** |
We can apply the Bayes theorem to obtain the probability that the student likes wine given it has no car, $P(A|B)$. Let's assume that only the ones who prefer wine go together for having a glass of wine at the bar after the statistics course. While they drink wine, the tell each other about their cars and they obtain the probability that a student who likes wine has no car, $P(B|A) = `r round(tab[1,2]/sum(tab[,2]),2)`$. During the statistics class the teacher asked the students about their car ownership and birthday preference. Therefore, they know that $P(A) =$ likes wine $= `r round(sum(tab[,2])/sum(tab),2)`$ and $P(B) =$ no car $= `r round(sum(tab[1,])/ sum(tab),2)`$. With these information, they can obtain the probability that a student likes wine given it has no car, even if not all students without cars went to the bar: $P(A|B) = \frac{`r round(tab[1,2]/sum(tab[,2]),2)`*`r round(sum(tab[,2])/sum(tab),2)`}{`r round(sum(tab[1,])/ sum(tab),2)`} = `r round((tab[1,2]/sum(tab[,2]))*(sum(tab[,2])/sum(tab))/ (sum(tab[1,])/ sum(tab)), 2)`$.
### Bayes theorem for continuous parameters
When we use the Bayes theorem for analyzing data, then the aim is to make probability statements for parameters. Because most parameters are measured at a continuous scale we use probability density functions to describe what we know about them. The Bayes theorem can be formulated for probability density functions denoted with $p(\theta)$, e.g. for a parameter $\theta$ (for example probability density functions see Chapter \@ref(distributions)).
What we are interested in is the probability of the parameter $\theta$ given the data, i.e., $p(\theta|y)$. This probability density function is called the posterior distribution of the parameter $\theta$. Here is the Bayes theorem formulated for obtaining the posterior distribution of a parameter from the data $y$, the prior distribution of the parameter $p(\theta)$ and assuming a model for the data generating process. The data model is defined by the likelihood that specifies how the data $y$ is distributed given the parameters $p(y|\theta)$:
$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{\int p(y|\theta)p(\theta) d\theta}$
The probability of the data $p(y)$ is also called the scaling constant, because it is a constant. It is the integral of the likelihood over all possible values of the parameter(s) of the model.
### Estimating a mean assuming that the variance is known
For illustration, we first describe a simple (unrealistic) example for which it is almost possible to follow the mathematical steps for solving the Bayes theorem even for non-mathematicians. Even if we cannot follow all steps, this example will illustrate the principle how the Bayesian theorem works for continuous parameters. The example is unrealistic because we assume that the variance $\sigma^2$ in the data $y$ is known.
We construct a data model by assuming that $y$ is normally distributed:
$p(y|\theta) = normal(\theta, \sigma)$, with $\sigma$ known. The function $normal$ defines the probability density function of the normal distribution (Chapter \@ref(distributions)).
The parameter, for which we would like to get the posterior distribution is $\theta$, the mean. We specify a prior distribution for $\theta$. Because the normal distribution is a conjugate prior for a normal data model with known variance, we use the normal distribution. Conjugate priors have nice mathematical properties (see Chapter \@ref(priors)) and are therefore preferred when the posterior distribution is obtained algebraically.
That is the prior:
$p(\theta) = normal(\mu_0, \tau_0)$
With the above data, data model and prior, the posterior distribution of the mean $\theta$ is defined by:
$p(\theta|y) = normal(\mu_n, \tau_n)$, where
$\mu_n= \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{n}{\sigma^2}\bar{y}}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}}$ and
$\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}$
$\bar{y}$ is the arithmetic mean of the data. Because only this value is needed in order to obtain the posterior distribution, this value is called the sufficient statistics.
From the mathematical formulas above and also from Fig. \@ref(fig:triplot) we see that the mean of the posterior distribution is a weighted average between the prior mean and $\bar{y}$ with weights equal to the precisions ($\frac{1}{\tau_0^2}$ and $\frac{n}{\sigma^2}$).
```{r triplot, fig.cap='Hypothetical example showing the result of applying the Bayes theorem for obtaining a posterior distribution of a continuous parameter. The likelhood is defined by the data and the model, the prior is expressing the knowledge about the parameter before looking at the data. Combining the two distributions using the Bayes theorem results in the posterior distribution.',echo=FALSE, results='hide', dpi=700}
library(blmeco)
par(mar=c(4,4,0.8,0.1), mgp=c(2,0.5,0), tck=-0.01)
triplot.normal.knownvariance(theta.data=3, variance.known=2, n=3,
prior.theta=0, prior.variance=10)
```
### Estimating the mean and the variance
We now move to a more realistic example, which is estimating the mean and the variance of a sample of weights of Snowfinches *Montifringilla nivalis* (Fig. \@ref(fig:ssp)). To analyze those data, a model with two parameters (the mean and the variance or standard deviation) is needed. The data model (or likelihood) is specified as $p(y|\theta, \sigma) = normal(\theta, \sigma)$.
```{r ssp, fig.align='center', echo=FALSE, fig.cap='Snowfinches stay above the treeline for winter. They come to feeders.'}
knitr::include_graphics('images/snowfinch2.JPG', dpi = 150)
```
```{r,echo=TRUE}
# weight (g)
y <- c(47.5, 43, 43, 44, 48.5, 37.5, 41.5, 45.5)
n <- length(y)
```
Because there are two parameters, we need to specify a two-dimensional prior distribution. We looked up in @Gelman2014 that the conjugate prior distribution in our case is an Normal-Inverse-Chisquare distribution:
$p(\theta, \sigma) = N-Inv-\chi^2(\mu_0, \sigma_0^2/\kappa_0; v_0, \sigma_0^2)$
From the same reference we looked up how the posterior distribution looks like in our case:
$p(\theta,\sigma|y) = \frac{p(y|\theta, \sigma)p(\theta, \sigma)}{p(y)} = N-Inv-\chi^2(\mu_n, \sigma_n^2/\kappa_n; v_n, \sigma_n^2)$, with
$\mu_n= \frac{\kappa_0}{\kappa_0+n}\mu_0 + \frac{n}{\kappa_0+n}\bar{y}$,
$\kappa_n = \kappa_0+n$,
$v_n = v_0 +n$,
$v_n\sigma_n^2=v_0\sigma_0^2+(n-1)s^2+\frac{\kappa_0n}{\kappa_0+n}(\bar{y}-\mu_0)^2$
For this example, we need the arithmetic mean $\bar{y}$ and standard deviation $s^2$ from the sample for obtaining the posterior distribution. Therefore, these two statistics are the sufficient statistics.
The above formula look intimidating, but we never really do that calculations. We let `R` doing that for us in most cases by simulating many numbers from the posterior distribution, e.g., using the function `sim` from the package arm [@Gelman.2007]. In the end, we can visualize the distribution of these many numbers to have a look at the posterior distribution.
In Fig. \@ref(fig:jointpdist) the two-dimensional $(\theta, \sigma)$ posterior distribution is visualized by using simulated values. The two dimensional distribution is called the joint posterior distribution. The mountain of dots in Fig. \@ref(fig:jointpdist) visualize the Normal-Inverse-Chisquare that we calculated above. When all values of one parameter is displayed in a histogram ignoring the values of the other parameter, it is called the marginal posterior distribution. Algebraically, the marginal distribution is obtained by integrating one of the two parameters out over the joint posterior distribution. This step is definitively way easier when simulated values from the posterior distribution are available!
```{r jointpdist, fig.cap='Visualization of the joint posterior distribution for the mean and standard deviation of Snowfinch weights. The lower left panel shows the two-dimensional joint posterior distribution, whereas the upper and right panel show the marginal posterior distributions of each parameter separately.', echo=FALSE, fig.height=7}
mod <- lm(y~1)
bsim <- sim(mod, n.sim=20000)
xrange <- range(bsim@coef)
yrange <- range(bsim@sigma)
layout(matrix(c(2,0,1,3), ncol=2, byrow=TRUE), heights=c(1,2),
width=c(2,1))
par(mar=c(5.1, 4.1, 0.2, 0.2), cex.lab=0.8, cex.axis=0.8)
plot(bsim@coef, bsim@sigma, xlab=expression(theta), ylab=expression(sigma), las=1, cex=0.5, xlim=xrange, ylim=yrange, col=rgb(0,0,0,0.05), pch=16)
par(mar=c(0, 4.1, 1, 0.2))
histtheta <- hist(bsim@coef, plot=FALSE)
plot(histtheta$breaks, seq(0, max(histtheta$density)+0.01, length=
length(histtheta$breaks)), type="n", xaxt="n", xlim = xrange,
las=1, ylab = "Density", yaxs="i")
rect(histtheta$breaks[1:(length(histtheta$breaks)-1)], 0,
histtheta$breaks[2:(length(histtheta$breaks))], histtheta$density,
col=grey(0.5))
mtext(paste(expression(p(theta|y)), "= t-distribution"), side=3, adj=0, cex=0.6)
par(mar=c(5.1, 0, 0.2, 0.2))
histsigma<- hist(bsim@sigma, plot=FALSE)
plot(seq(0, max(histsigma$density)+0.01, length=length(histsigma$breaks)),
histsigma$breaks, type="n", yaxt="n", ylim=yrange,
xlab="Density", xaxs="i",
xaxt="n")
axis(1, at=seq(0, 0.4, by=0.1), labels=c(NA, 0.1, 0.3, 0.3, 0.4))
rect(0, histsigma$breaks[2:(length(histsigma$breaks))], histsigma$density,
histsigma$breaks[1:(length(histsigma$breaks)-1)],col=grey(0.5))
text(0.04, 18, "p(sigma|y) = \nInv-Chisq-dist.", adj=c(0,1), cex=0.6)
```
The marginal posterior distributions of every parameter is what we normally report in a paper to report statistical uncertainty.
In our example, the marginal distribution of the mean is a t-distribution (Chapter \@ref(distributions)). Frequentist statistical methods also use a t-distribution to describe the uncertainty of an estimated mean for the case when the variance is not known. Thus, frequentist methods came to the same solution using a completely different approach and different techniques. Doesn't that increase dramatically our trust in statistical methods?
## Classical frequentist tests and alternatives
### Nullhypothesis testing
Null hypothesis testing is constructing a model that describes how the data would look like in case of what we expect to be would not be. Then, the data is compared to how the model thinks the data should look like. If the data does not look like the model thinks they should, we reject that model and accept that our expectation may be true.
To decide whether the data looks like the null-model thinks the data should look like the p-value is used. The p-value is the probability of observing the data or more extreme data given the null hypothesis is true.
Small p-values indicate that it is rather unlikely to observe the data or more extreme data given the null hypothesis $H_0$ is true.
Null hypothesis testing is problematic. We discuss some of the problems after having introduces the most commonly used classical tests.
### Comparison of a sample with a fixed value (one-sample t-test)
In some studies, we would like to compare the data to a theoretical value. The theoretical value is a fixed value, e.g. calculated based on physical, biochemical, ecological or any other theory. The statistical task is then to compare the mean of the data including its uncertainty with the theoretical value. The result of such a comparison may be an estimate of the mean of the data with its uncertainty or an estimate of the difference of the mean of the data to the theoretical value with the uncertainty of this difference.
<!-- fk: HIER WEITER: include a figure that shows the data in a box and the fixed value and the mean with CI, and one only showing data with stars-->
For example, a null hypothesis could be $H_0:$"The mean of Snowfinch weights is exactly 40g."
A normal distribution with a mean of $\mu_0=40$ and a variance equal to the estimated variance in the data $s^2$ is then assumed to describe how we would expect the data to look like given the null hypothesis was true. From that model it is possible to calculate the distribution of hypothetical means of many different hypothetical samples of sample size $n$. The result is a t-distribution (Fig. \@ref(fig:nht)). In classical tests, the distribution is standardized so that its variance was one. Then the sample mean, or in classical tests a standardized difference between the mean and the hypothetical mean of the null hypothesis (here 40g), called test statistics $t = \frac{\bar{y}-\mu_0}{\frac{s}{\sqrt{n}}}$, is compared to that (standardized) t-distribution. If the test statistics falls well within the expected distribution the null hypothesis is accepted. Then, the data is well compatible with the null hypothesis. However, if the test statistics falls in the tails or outside the distribution, then the null hypothesis is rejected and we could write that the mean weight of Snowfinches is statistically significantly different from 40g. Unfortunately, we cannot infer about the probability of the null hypothesis and how relevant the result is.
```{r nht, fig.cap='Illustration of a one-sample t-test. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the t-distribution that shows how hypothetical sample means are expected to be distributed if the big population of Snowfinches has a mean weight of 40g (i.e., if the null hypothesis was true). Orange area shows the area of the t-distribution that lays equal or farther away from 40g than the sample mean. The orange area is the p-value.', fig.width=8, fig.height=5, fig.align='center',echo=FALSE}
# use density function for t-distribution as given in Gelman et al. (2014) BDA-book
dt_BDA <- function(x,v,m,s) gamma((v+1)/2)/(gamma(v/2)*sqrt(v*pi)*s)*(1+1/v*((x-m)/s)^2)^(-(v+1)/2)
par(mar=c(4.5, 5, 2, 2))
hist(y, col="blue", xlim=c(30, 52), las=1, freq=FALSE, main=NA, ylim=c(0, 0.3))
abline(v=mean(y), lwd=3, col="lightblue")
abline(v=40, lwd=2)
text(41, 0.31, "H0", xpd=NA)
text(30, 0.25, "t(7, 40, SE(mean(y))", adj=c(0,0), cex=0.8)
x <- seq(20, 60, length=100)
dy <- dt_BDA(x, v=7, m=40, s=sd(y)/sqrt(n))
lines(x, dy)
index <- x>=mean(y)
polygon(c(x[index], rev(x[index])), c(rep(0, sum(index)), rev(dy[index])), border=NA, col="orange")
index <- x<= 40- (mean(y)-40)
polygon(c(x[index], rev(x[index])), c(rep(0, sum(index)), rev(dy[index])), border=NA, col="orange")
```
We can use the r-function `t.test` to calculate the p-value of a one sample t-test.
```{r,echo=TRUE}
t.test(y, mu=40)
```
The output of the r-function `t.test` also includes the mean and the 95% confidence interval (or compatibility or uncertainty interval) of the mean. The CI could, alternatively, be obtained as the 2.5% and 97.5% quantiles of a t-distribution with a mean equal to the sample mean, degrees of freedom equal to the sample size minus one and a standard deviation equal to the standard error of the mean.
```{r}
# lower limit of 95% CI
mean(y) + qt(0.025, df=length(y)-1)*sd(y)/sqrt(n)
# upper limit of 95% CI
mean(y) + qt(0.975, df=length(y)-1)*sd(y)/sqrt(n)
```
When applying the Bayes theorem for obtaining the posterior distribution of the mean we end up with the same t-distribution as described above, in case we use flat prior distributions for the mean and the standard deviation. Thus, the two different approaches end up with the same result!
```{r, fig.cap='Illustration of the posterior distribution of the mean. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the posterior distribution that shows what we know about the mean after having looked at the data. The area under the posterior density function that is larger than 40 is the posterior probability of the hypothesis that the true mean Snwofinch weight is larger than 40g.',echo=TRUE, fig.height=6}
par(mar=c(4.5, 5, 2, 2))
hist(y, col="blue", xlim=c(30,52), las=1, freq=FALSE, main=NA, ylim=c(0, 0.3))
abline(v=mean(y), lwd=2, col="lightblue")
abline(v=40, lwd=2)
lines(density(bsim@coef))
text(45, 0.3, "posterior distribution\nof the mean of y", cex=0.8, adj=c(0,1), xpd=NA)
```
The posterior probability of the hypothesis that the true mean Snowfinch weight is larger than 40g, $P(H:\mu>40) =$, is equal to the proportion of simulated random values from the posterior distribution, saved in the vector `bsim@coef`, that are larger than 40.
```{r}
# Two ways of calculating the proportion of values
# larger than a specific value within a vector of values
round(sum(bsim@coef[,1]>40)/nrow(bsim@coef),2)
round(mean(bsim@coef[,1]>40),2)
# Note: logical values TRUE and FALSE become
# the numeric values 1 and 0 within the functions sum() and mean()
```
We, thus, can be pretty sure that the mean Snowfinch weight (in the big world population) is larger than 40g. Such a conclusion is not very informative, because it does not tell us how much larger we can expect the mean Snowfinch weight to be. Therefore, we prefer reporting a credible interval (or compatibility interval or uncertainty interval) that tells us what values for the mean Snowfinch weight are compatible with the data (given the data model we used realistically reflects the data generating process). Based on such an interval, we can conclude that we are pretty sure that the mean Snowfinch weight is between 40 and 48g.
```{r}
# 80% credible interval, compatibility interval, uncertainty interval
quantile(bsim@coef[,1], probs=c(0.1, 0.9))
# 95% credible interval, compatibility interval, uncertainty interval
quantile(bsim@coef[,1], probs=c(0.025, 0.975))
# 99% credible interval, compatibility interval, uncertainty interval
quantile(bsim@coef[,1], probs=c(0.005, 0.995))
```
### Comparison of the locations between two groups (two-sample t-test)
Many research questions aim at measuring differences between groups. For example, we could be curious to know how different in size car owner are from people not owning a car.
A boxplot is a nice possibility to visualize the ell length measurements of two (or more) groups (Fig. \@ref(fig:boxplt)). From the boxplot, we do not see how many observations are in the two samples. We can add that information to the plot. The boxplot visualizes the samples but it does not show what we know about the big (unmeasured) population mean. To show that, we need to add a compatibility interval (or uncertainty interval, credible interval, confidence interval, in brown in Fig. \@ref(fig:boxplt)).
```{r boxplt, fig.cap='Ell length of car owners (Y) and people not owning a car (N). Horizontal bar = median, box = interquartile range, whiskers = extremest observation within 1.5 times the interquartile range from the quartile, circles=observations farther than 1.5 times the interquartile range from the quartile. Filled brown circles = means, vertical brown bars = 95% compatibility interval.',echo=FALSE}
par(mar=c(4.5,4.5,2,0.2))
boxplot(ell~car, data=dat, las=1, ylab="Length of ell (cm)", col="grey", boxwex=0.2,
ylim=c(35, 50))
means <- tapply(dat$ell, dat$car, mean)
semeans <- tapply(dat$ell, dat$car, function(x) sd(x)/sqrt(length(x)))
segments(c(1.2,2.2), means-2*semeans, c(1.2,2.2), means+2*semeans, lwd=2, col="brown")
points(c(1.2,2.2), means, pch=21, bg="brown")
text(x=c(1,2), y=c(36, 36), labels=paste("n=", table(dat$car)))
```
When we added the two means with a compatibility interval, we see what we know about the two means, but we do still not see what we know about the difference between the two means. The uncertainties of the means do not show the uncertainty of the difference between the means. To do so, we need to extract the difference between the two means from a model that describes (abstractly) how the data has been generated. Such a model is a linear model that we will introduce in Chapter \@ref(lm). The second parameter measures the differences in the means of the two groups. And from the simulated posterior distribution we can extract a 95% compatibility interval.
Thus, we can conclude that the average ell length of car owners is with high probability between 0.5 cm smaller and 2.5 cm larger than the averag ell of people not having a car.
```{r, echo=TRUE}
mod <- lm(ell~car, data=dat)
mod
bsim <- sim(mod, n.sim=nsim)
quantile(bsim@coef[,"carY"], prob=c(0.025, 0.5, 0.975))
```
The corresponding two-sample t-test gives a p-value for the null hypothesis: "The difference between the two means equals zero.", a confidence interval for the difference and the two means. While the function `lm`gives the difference Y minus N, the function `t.test`gives the difference N minus Y. Luckily the two means are also given in the output, so we know which group mean is the larger one.
```{r, echo=TRUE}
t.test(ell~car, data=dat, var.equal=TRUE)
```
In both possibilities, we used to compare the to means, the Bayesian posterior distribution of the difference and the t-test or the confidence interval of the difference, we used a data model. We thus assumed that the observations are normally distributed. In some cases, such an assumption is not a reasonable assumption. Then the result is not reliable. In such cases, we can either search for a more realistic model or use non-parametric (also called distribution free) methods. Nowadays, we have almost infinite possibilities to construct data models (e.g. generalized linear models and beyond). Therefore, we normally start looking for a model that fits the data better. However, in former days, all these possiblities did not exist (or were not easily available for non-mathematicians). Therefore, we here introduce two of such non-parametric methods, the Wilcoxon-test (or Mann-Whitney-U-test) and the randomisation test.
Some of the distribution free statistical methods are based on the rank instead of the value of the observations. The principle of the Wilcoxon-test is to rank the observations and sum the ranks per group. It is not completely true that the non-parametric methods do not have a model. The model of the Wilcoxon-test "knows" how the difference in the sum of the ranks between two groups is distributed given the mean of the two groups do not differ (null hypothesis). Therefore, it is possible to get a p-value, e.g. by the function `wilcox.test`.
```{r, echo=TRUE}
wilcox.test(ell~car, data=dat)
```
The note in the output tells us that ranking is ambiguous, when some values are equal. Equal values are called ties when they should be ranked.
The result of the Wilcoxon-test tells us how probable it is to observe the difference in the rank sum between the two sample or a more extreme difference given the means of the two groups are equal. That is at least something.
A similar result is obtained by using a randomisation test. This test is not based on ranks but on the original values. The aim of the randomisation is to simulate a distribution of the difference in the arithmetic mean between the two groups assuming this difference would be zero. To do so, the observed values are randomly distributed among the two groups. Because of the random distribution among the two groups, we expect that, if we repeat that virtual experiment many times, the average difference between the group means would be zero (both virtual samples are drawn from the same big population).
We can use a loop in R for repeating the random re-assignement to the two groups and, each time, extracting the difference between the group means. As a result, we have a vector of many (`nsim`) values that all are possible differences between group means given the two samples were drawn from the same population. The proportion of these values that have an equal or larger absolute value give the probability that the observed or a larger difference between the group means is observed given the null hypothesis would be true, thus that is a p-value.
```{r, echo=TRUE}
diffH0 <- numeric(nsim)
for(i in 1:nsim){
randomcars <- sample(dat$car)
rmod <- lm(ell~randomcars, data=dat)
diffH0[i] <- coef(rmod)["randomcarsY"]
}
mean(abs(diffH0)>abs(coef(mod)["carY"])) # p-value
```
Visualizing the possible differences between the group means given the null hypothesis was true shows that the observed difference is well within what is expected if the two groups would not differ in their means (Fig. \@ref(fig:ranhist)).
```{r ranhist, fig.cap='Histogram if differences between the means of randomly assigned groups (grey) and the difference between the means of the two observed groups (red)', echo=FALSE}
par(mar=c(5.5, 4.5, 0.5, 0), cex=0.8)
hist(diffH0, main=NA); abline(v=coef(mod)[2], lwd=2, col="red", main=NA)
```
The randomization test results in a p-value and, we could also report the observed difference between the group means. However, it does not tell us, what values of the difference all would be compatible with the data. We do not get an uncertainty measurement for the difference.
In order to get a compatibility interval without assuming a distribution for the data (thus non-parametric) we could bootstrap the samples.
Bootstrapping is sampling observations from the data with replacement. For example, if we have a sample of 8 observations, we draw 8 times a random observation from the 8 observation. Each time, we assume that all 8 observations are available. Thus a bootstrapped sample could include some observations several times, whereas others are missing. In this way, we simulate the variance in the data that is due to the fact that our data do not contain the whole big population.
Also bootstrapping can be programmed in R using a loop.
```{r, echo=TRUE}
diffboot <- numeric(1000)
for(i in 1:nsim){
ngroups <- 1
while(ngroups==1){
bootrows <- sample(1:nrow(dat), replace=TRUE)
ngroups <- length(unique(dat$car[bootrows]))
}
rmod <- lm(ell~car, data=dat[bootrows,])
diffboot[i] <- coef(rmod)[2]
}
quantile(diffboot, prob=c(0.025, 0.975))
```
The resulting values for the difference between the two group means can be interpreted as the distribution of those differences, if we had repeated the study many times (Fig. \@ref(fig:histboot)).
A 95% interval of the distribution corresponds to a 95% compatibility interval (or confidence interval or uncertainty interval).
```{r histboot, fig.cap='Histogram of the boostrapped differences between the group means (grey) and the observed difference.', echo=TRUE}
hist(diffboot); abline(v=coef(mod)[2], lwd=2, col="red")
```
For both methods, randomisation test and bootstrapping, we have to assume that all observations are independent. Randomization and bootstrapping becomes complicated or even unfeasible when data are structured.
## Summary
Bayesian data analysis is applying the Bayes theorem for summarizing knowledge based on data, priors and the model assumptions.
Frequentist statistics is quantifying uncertainty by hypothetical repetitions.