-
Notifications
You must be signed in to change notification settings - Fork 0
/
040-application.Rmd
2325 lines (1674 loc) · 90.7 KB
/
040-application.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r ch040-setup, include=FALSE}
library(FangPsychometric)
library(tidyverse)
library(kableExtra)
library(patchwork)
library(rstan)
library(bayesplot)
library(loo)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
av_dat <- local({
dat <- audiovisual_binomial %>%
filter(block %in% c("baseline", "adapt1"),
rid != "av-adapt1-O-f-CE") %>%
mutate(x = soa / 1000,
rid = factor(rid),
sid = factor(sid),
block = factor(block)) %>%
as.list()
dat$N <- length(dat$x)
dat$N_G <- length(levels(dat$age_group))
dat$N_S <- length(levels(dat$sid))
dat$N_T <- length(levels(dat$trial))
dat
})
vis_dat <- local({
dat <- visual_binomial %>%
filter(block %in% c("baseline", "adapt1")) %>%
mutate(x = soa / 1000,
rid = factor(rid),
sid = factor(sid),
block = factor(block)) %>%
as.list()
dat$N <- length(dat$x)
dat$N_G <- length(levels(dat$age_group))
dat$N_S <- length(levels(dat$sid))
dat$N_T <- length(levels(dat$trial))
dat
})
logit <- function(p) qlogis(p)
inv_logit <- function(x) plogis(x)
logistic <- function(x) inv_logit(x)
fn <- function(x, a, b) logistic(b * (x - a))
stan_summary <- function(object, ...) {
round(summary(object, ...)$summary, 4)[,c(1,2,3,4,8,9,10)]
}
```
# Bayesian Multilevel Modeling of the Psychometric Function {#application}
Now we apply the workflow to our data set, seeking to build a model that satisfies the four criteria in [chapter 2](#methods). During model refinement, we'll pick up on the regular features while assessing the statistical significance of covariates through predictive comparison. In our final model, we'll model well-known data quality issues (from participant lapses in judgment) and show that this improves model prediction.
As we iterate through our model development workflow, we have a preference for multilevel models (for the reasons discussed in [chapter 2](#methods)). Hierarchical models are a specific kind of multilevel model where one or more groups are nested within a larger one. In the case of the psychometric data, there are three age groups, and within each age group are individual subjects.
## Modeling psychometric quantities {#psych-quant}
Before formally conducting the workflow, we motivate our choice of using the logistic function to model the psychometric function as well as our choice for the parameterization of the linear predictor. Three common sigmoid functions are displayed in figure \@ref(fig:ch040-pf-assortment).
```{r ch040-pf-assortment, fig.cap="Assortment of psychometric functions."}
ggplot(data.frame(x = c(-6, 6), y = c(0, 1)), aes(x, y)) +
stat_function(aes(linetype = "Logistic"), fun = plogis) +
stat_function(aes(linetype = "Gaussian"), fun = pnorm) +
stat_function(aes(linetype = "Weibull"),
fun = pweibull, args = list(shape = 3)) +
labs(title = "Assortment of Psychometric Functions") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.title = element_blank(),
legend.position = c(0.17, 0.75))
```
The Weibull psychometric function is more common when it comes to 2-alternative forced choice (2-AFC) psychometric experiments where the independent variable is a stimulus intensity (non-negative) and the goal is signal detection. The data in this paper includes both positive and negative SOA values, so the Weibull is not a natural choice. Our first choice is the logistic function as it is the canonical choice for Binomial count data. The data in this study are exchangeable, meaning that the label of a positive response can be swapped with the label of a negative response and the inferences would remain the same. Since there is no natural ordering, it makes more sense for the psychometric function to be symmetric, e.g. the logistic function and Gaussian CDF. We use symmetric loosely to mean that the probability density function (PDF) has zero skewness. In practice, there is little difference in inferences between the logit and probit links, but computationally the logit link is more efficient.
It is appropriate to provide additional background to GLMs and their role in working with psychometric functions. A GLM allows the linear model to be related to the outcome variable via a link function. An example of this is the logit link -- the inverse of the logistic function. The logistic function, $F$, takes $x \in \mathbb{R}$ and constrains the output to be in $(0, 1)$.
\begin{equation}
F(\theta) = \frac{1}{1 + \exp\left(-\theta\right)}
(\#eq:logistic)
\end{equation}
Since $F$ is a strictly increasing and continuous function, it has an inverse, and the link for \@ref(eq:logistic) is the log-odds or logit function.
\begin{equation}
F^{-1}(\pi) = \mathrm{logit}(\pi) = \ln\left(\frac{\pi}{1 - \pi}\right)
(\#eq:logit)
\end{equation}
By taking $(F^{-1} \circ F)(\theta)$ we can arrive at a relationship that is linear in $\theta$.
\setstretch{1.0}
\begin{align*}
\pi = F(\theta) \Longleftrightarrow F^{-1}(\pi) &= F^{-1}(F(\theta)) \\
& = \ln\left(\frac{F(\theta)}{1 - F(\theta)}\right) \\
&= \ln(F(\theta)) - \ln(1 - F(\theta)) \\
&= \ln\left(\frac{1}{1 + \exp(-\theta)}\right) - \ln\left(\frac{\exp(-\theta)}{1 + \exp(-\theta)}\right) \\
&= - \ln(1 + \exp(-\theta)) - \ln(\exp(-\theta)) + \ln(1 + \exp(-\theta)) \\
&= - \ln(\exp(-\theta)) \\
&= \theta
\end{align*}
\setstretch{2.0}
The motivation for this background is to show that a model for the psychometric function can be specified using a linear predictor, $\theta$. Given a simple slope-intercept model, the linear predictor would typically be written as:
\begin{equation}
\theta = \alpha + \beta x
(\#eq:linearform1)
\end{equation}
This isn't the only possible form; it could be written in the slope-location parameterization:
\begin{equation}
\theta = \beta(x - a)
(\#eq:linearform2)
\end{equation}
Both parameterizations will describe the same geometry, so why should it matter which form is chosen? The interpretation of the parameters change between the two models, but the reason becomes clear when we consider how the linear model relates back to the physical properties that the psychometric model describes. Take equation \@ref(eq:linearform1), substitute it in to \@ref(eq:logistic), and then take the logit of both sides:
\begin{equation}
\mathrm{logit}(\pi) = \alpha+\beta x
(\#eq:pfform1)
\end{equation}
Now recall that the PSS is defined as the SOA value such that the response probability, $\pi$, is $0.5$. Substituting $\pi = 0.5$ into \@ref(eq:pfform1) and solving for $x$ yields:
$$pss = -\frac{\alpha}{\beta}$$
Similarly, the JND is defined as the difference between the SOA value at the 84% level and the PSS. Substituting $\pi = 0.84$ into \@ref(eq:pfform1), solving for $x$, and subtracting off the pss yields:
\begin{equation}
jnd = \frac{\mathrm{logit}(0.84)}{\beta}
(\#eq:jnd1)
\end{equation}
From the conceptual analysis, it is easy to define priors for the PSS and JND, but then how does one set the priors for $\alpha$ and $\beta$? Let's say the prior for the just noticeable difference is $jnd \sim \pi_j$. Then the prior for $\beta$ would be:
$$\beta \sim \frac{\mathrm{logit}(0.84)}{\pi_j}$$
The log-normal distribution has a nice property where its multiplicative inverse is still a log-normal distribution. If we let $\pi_j = \mathrm{Lognormal}(\mu, \sigma^2)$, then $\beta$ would be distributed as:
$$
\beta \sim \mathrm{Lognormal}(-\mu + \ln(\mathrm{logit}(0.84)), \sigma^2)
$$
This is acceptable as the slope must always be positive for this psychometric data, and a log-normal distribution constrains the support to positive real numbers. Next suppose that the prior distribution for the PSS is $pss \sim \pi_p$. Then the prior for $\alpha$ is:
$$\alpha \sim -\pi_p \cdot \beta$$
If $\pi_p$ is set to a log-normal distribution as well, then $\pi_p \cdot \beta$ would also be log-normal, but there is still the problem of the negative sign. If $\alpha$ is always negative, then the PSS will also always be negative, which is certainly not always true. Furthermore, we don't want to _a priori_ put more weight on positive PSS values compared to negative ones.
Let's now consider using equation \@ref(eq:linearform2) and repeat the above process.
\begin{equation}
\mathrm{logit}(\pi) = \beta(x - a)
(\#eq:pfform2)
\end{equation}
The just noticeable difference is still given by \@ref(eq:jnd1), and so the same method for choosing a prior can be used. However, the PSS is now given by:
$$pss = \alpha$$
This is a fortunate consequence of using \@ref(eq:linearform2) because now the JND only depends on $\beta$ and the PSS only depends on $\alpha$. Additionally $\alpha$ can be interpreted as the PSS of the estimated psychometric function. Also thrown in is the ability to set a prior for $\alpha$ that is symmetric around $0$ such as a Gaussian distribution.
This also highlights the benefit of using a modeling language like `Stan` over others. For fitting GLMs in `R`, one can use `stats::glm` which utilizes MLE, or others such as `rstanarm::stan_glm` and `arm::bayesglm` that use Bayesian methods [@R-rstanarm; @R-arm]. Each of these functions requires the linear predictor to be in the form of \@ref(eq:linearform1). The `stan_glm` function uses Stan in the back-end to fit a model, but is limited to priors from the Student-t family of distributions. By writing the model directly in `Stan`, the linear model can be parameterized in any way and with any prior distribution, and so allows for much more expressive modeling.
With the consideration for the choice of sigmoid function and linear parameterization complete, we begin to develop a multilevel model for the psychometric function.
## Iteration 1: base model {#iter1}
**Pre-Model, Pre-Data**
_Conceptual Analysis_
In section \@ref(toj-task) we discussed the experimental setup and data collection. Subjects are presented with two stimuli separated by some temporal delay, and they are asked to respond as to their perception of the temporal order. There are 45 subjects with 15 each in the young, middle, and older age groups. As the SOA becomes larger in the positive direction, subjects are expected to give more "positive" responses, and as the SOA becomes larger in the negative direction, more "negative" responses are expected. By the way the experiment and responses are constructed, there is no expectation to see a reversal of this trend unless there was an issue with the subject's understanding of the directions given to them or an error in the recording process.
After the first experimental block the subjects go through a temporal recalibration period, and repeat the experiment. The interest is in seeing if the recalibration has an effect on temporal sensitivity and perceptual synchrony, and if the effect is different for each age group.
_Define Observational Space_
The response that subjects give during a TOJ task is recorded as a zero or a one, and their relative performance is determined by the SOA value. Let $y$ represent the binary outcome of a trial and let $x$ be the SOA value.
\setstretch{1.0}
\begin{align*}
y_i &\in \lbrace 0, 1\rbrace \\
x_i &\in \mathbb{R}
\end{align*}
\setstretch{2.0}
If the SOA values are fixed as in the audiovisual task, then the responses can be aggregated into Binomial counts, $k$.
$$k_i, n_i \in \mathbb{Z}_0^+, k_i \le n_i$$
In the above expression, $\mathbb{Z}_0^+$ represents the set of non-negative integers. Notice that the number of trials $n$ has an index variable $i$. This is because the number of trials per SOA is not fixed between blocks. In the pre-adaptation block, there are five trials per SOA compared to three in the post-adaptation block. So if observation $32$ is recorded during a "pre" block, $n_{32} = 5$, and if observation $1156$ is during a "post" block, $n_{1156} = 3$.
Then there are three categorical variables: age group, subject ID, and trial (block). The first two are treated as factor variables (also known as index variable or categorical variable). Rather than using one-hot encoding or dummy variables, the age levels are left as categories, and a coefficient is fit for each level. If dummy variables were used for all 45 subjects, there would be 44 dummy variables to work with times the number of coefficients that make estimates at the subject level. The number of parameters in the model grows rapidly as the model complexity grows.
Age groups and individual subjects can be indexed in the same way that the number of trials is indexed. $S_i$ refers to the subject in record $i$, and similarly $G_i$ refers to the age group of that subject. Observation $63$ is for record ID `r FangPsychometric::audiovisual_binomial$rid[63]`, so then $S_{63}$ is `r FangPsychometric::audiovisual_binomial$sid[63]` and $G_{63}$ is `r FangPsychometric::audiovisual_binomial$age_group[63]`. Under the hood of `R`, these factor levels are represented as integers (e.g. middle age group level is stored internally as the number 2).
\setstretch{1.0}
```{r ch041-Olive Longitude, echo=TRUE}
(x <- factor(c("a", "a", "b", "c")))
storage.mode(x)
```
\setstretch{2.0}
This data storage representation can later be exploited for the `Stan` model.
The pre- and post-adaptation categories are treated as a binary indicator referred to as $trt$ (short for treatment) since there are only two levels in the category. In this setup, a value of $1$ indicates a post-adaptation block. This encoding is chosen over the reverse because the pre-adaptation block is like the baseline performance, and it is more appropriate to interpret the post-adaptation block as turning on some effect. Using a binary indicator in a regression setting may not be the best practice as we discuss in section \@ref(iter2).
_Construct Summary Statistics_
A set of summary statistics are constructed that help answer the questions of domain expertise consistency and model adequacy. We are studying the affects of age and temporal recalibration through the PSS and JND (see section \@ref(psycho-experiments)), so it is natural to define summary statistics around these quantities to verify model consistency.
It is impossible that a properly conducted block would result in a JND less than 0 (the psychometric function is always non-decreasing), so that can be a lower limit for its threshold. It is also unlikely that the just noticeable difference would be more than a second. Some studies show that we cannot perceive time differences below 30 ms, and others show that an input lag as small as 100ms can impair a person's typing ability. A time delay of 100ms is enough to notice, and so a just noticeable difference should be much less than one second -- much closer to 100ms. We will continue to use one second as an extreme estimate indicator, but will incorporate this knowledge when it comes to selecting priors.
The point of subjective simultaneity can be either positive or negative, with the belief that larger values are less likely. Some studies suggest that for audio-visual temporal order judgment tasks, the separation between stimuli need to be as little as 20ms for subjects to be able to determine which modality came first [@vatakis2007influence]. Other studies suggest that our brains can detect temporal differences as small as 30ms. If these values are to be believed then we should be skeptical of PSS estimates larger than say 150ms in absolute value, just to be safe.
A histogram of computed PSS and JND values will suffice for summary statistics. We can estimate the proportion of values that fall outside of our limits defined above, and use them as indications of problems with the model fitting or conceptual understanding.
**Post-Model, Pre-Data**
_Develop Model_
We begin with the simplest model that captures the structure of the data without including information about age group, treatment, or subject. Here is a simple model that draws information from the conceptual analysis:
\setstretch{1.0}
\begin{align*}
k_i &\sim \mathrm{Binomial}(n_i, p_i) \\
\mathrm{logit}(p_i) &= \beta ( x_i - \alpha )
\end{align*}
\setstretch{2.0}
Recall that we are using the linear model from \@ref(eq:linearform2). The PSS can be positive or negative without any expected bias towards either, so a symmetric distribution such as the Gaussian is a reasonable choice for $\alpha$. We determined earlier that a PSS value more than 150ms in absolute value is unlikely, so we can define a Gaussian prior such that $P(|pss| > 0.150) \approx 0.01$. Since the prior does not need to be exact, the following mean and variance suffice:
$$
pss \sim \mathcal{N}(0, 0.06^2) \Longleftrightarrow \alpha \sim \mathcal{N}(0, 0.06^2)
$$
For the just noticeable difference, we continue to use the log-normal distribution because it is constrained to positive values and has the reciprocal property. The JND is expected to be close to 100ms and unlikely to exceed $1$ second. This implies a prior such that the mean is around 100ms and the bulk of the distribution is below 1 second. I.e. $E[X] \approx 0.100$ and $P(X < 1) \approx 0.99$. This requires solving a system of nonlinear equations in two variables:
\setstretch{1.0}
$$
\begin{cases}
0.100 &= \exp\left(\mu + \sigma^2 / 2\right) \\
0.99 &= 0.5 + 0.5 \cdot \mathrm{erf}\left[\frac{\ln (1) - \mu}{\sqrt{2} \cdot \sigma}\right]
\end{cases}
$$
\setstretch{2.0}
This nonlinear system can be solved using `Stan`'s algebraic solver (code provided in the [appendix](#code)).
```{stan ch041-Electron Black, output.var="prior_jnd"}
functions {
vector system(vector y, vector theta, real[] x_r, int[] x_i) {
vector[2] z;
z[1] = exp(y[1] + y[2]^2 / 2) - theta[1];
z[2] = 0.5 + 0.5 * erf(-y[1] / (sqrt(2) * y[2])) - theta[2];
return z;
}
}
transformed data {
vector[2] y_guess = [1, 1]';
real x_r[0];
int x_i[0];
}
transformed parameters {
vector[2] theta = [0.100, 0.99]';
vector[2] y;
y = algebra_solver(system, y_guess, theta, x_r, x_i);
}
```
\setstretch{1.0}
```{r ch041-Dreaded Creek, echo=TRUE}
fit <- sampling(prior_jnd,
iter=1, warmup=0, chains=1, refresh=0,
seed=31, algorithm="Fixed_param")
sol <- extract(fit)
sol$y
```
\setstretch{2.0}
The solver has determined that $\mathrm{Lognormal}(`r round(sol$y[1], 1)`, `r round(sol$y[2], 1)`^2)$ is the appropriate prior. However, simulating some values from this distribution produces a lot of extremely small values ($<10^{-5}$) and a few extremely large values ($\approx 10^2$). This is because the expected value of a log-normal random variable depends on both the mean and standard deviation. If the median is used in place for the mean, then a more acceptable prior may be determined.
```{stan ch041-Dirty Crayon, output.var="prior_jnd_using_median"}
functions {
vector system(vector y, vector theta, real[] x_r, int[] x_i) {
vector[2] z;
z[1] = exp(y[1]) - theta[1];
z[2] = 0.5 + 0.5 * erf(-y[1] / (sqrt(2) * y[2])) - theta[2];
return z;
}
}
transformed data {
vector[2] y_guess = [1, 1]';
real x_r[0];
int x_i[0];
}
transformed parameters {
vector[2] theta = [0.100, 0.99]';
vector[2] y;
y = algebra_solver(system, y_guess, theta, x_r, x_i);
}
```
\setstretch{1.0}
```{r ch041-Aimless Skunk, echo=TRUE}
fit <- sampling(prior_jnd_using_median,
iter=1, warmup=0, chains=1, refresh=0,
seed=31, algorithm="Fixed_param")
sol <- extract(fit)
sol$y
```
\setstretch{2.0}
Sampling from a log-normal distribution with these parameters and plotting the histogram shows no inconsistency with the domain expertise.
```{r ch041-Risky-Lion}
set.seed(12)
x <- rlnorm(10000, sol$y[1], sol$y[2])
ggplot(data.frame(x), aes(x = x, y = ..density..)) +
geom_histogram(bins = 50) +
labs(x = "Just Noticeable Difference",
title = "Prior Distribution for the JND") +
theme_minimal()
```
With a prior for the JND, the prior for $\beta$ can be determined:
\setstretch{1.0}
$$
jnd \sim \mathrm{Lognormal}(`r round(sol$y[1], 1)`, `r round(sol$y[2], 2)`^2) \Longleftrightarrow \frac{1}{jnd} \sim \mathrm{Lognormal}(`r round(-1*sol$y[1], 1)`, `r round(sol$y[2], 2)`^2)
$$
and
$$
\beta = \frac{\mathrm{logit}(0.84)}{jnd} \sim \mathrm{Lognormal}(`r round(-1*sol$y[1] + log(qlogis(0.84)), 1)`, `r round(sol$y[2], 2)`^2)
$$
The priors do not need to be too exact. Rounding the parameters for $\beta$, the simple model is:
\begin{align*}
k_i &\sim \mathrm{Binomial}(n_i, p_i) \\
\mathrm{logit}(p_i) &= \beta ( x_i - \alpha ) \\
\alpha &\sim \mathcal{N}(0, 0.06^2) \\
\beta &\sim \mathrm{Lognormal}(3, 1^2)
\end{align*}
and in `Stan`, the model code is:
```{stan ch041-Maroon Vulture, output.var="m041_stan", echo=TRUE}
data {
int N;
int n[N];
int k[N];
vector[N] x;
}
parameters {
real alpha;
real<lower=0> beta;
}
model {
vector[N] p = beta * (x - alpha);
alpha ~ normal(0, 0.06);
beta ~ lognormal(3.0, 1.0);
k ~ binomial_logit(n, p);
}
generated quantities {
vector[N] log_lik;
vector[N] k_pred;
vector[N] theta = beta * (x - alpha);
vector[N] p = inv_logit(theta);
for (i in 1:N) {
log_lik[i] = binomial_logit_lpmf(k[i] | n[i], theta[i]);
k_pred[i] = binomial_rng(n[i], p[i]);
}
}
```
\setstretch{2.0}
Notice that the model block is nearly identical to the mathematical model specified above.
_Construct Summary Functions_
The next step is to construct any relevant summary functions. Since the distribution of posterior PSS and JND values are needed for the summary statistics, it will be convenient to have a function that can take in the posterior samples for $\alpha$ and $\beta$ and return the PSS and JND values. We define $Q$ as a more general function that takes in the two parameters and a target probability, $\pi$, and returns the distribution of SOA values at $\pi$.
\begin{equation}
Q(\pi; \alpha, \beta) = \frac{\mathrm{logit(\pi)}}{\beta} + \alpha
(\#eq:summfun1)
\end{equation}
The function can be defined in `R` as
```{r ch041-Furious Crossbow, echo=TRUE}
Q <- function(p, a, b) qlogis(p) / b + a
```
With $Q$, the PSS and JND can be calculated as
\setstretch{1.0}
\begin{align*}
pss &= Q(0.5) \\
jnd &= Q(0.84) - Q(0.5)
\end{align*}
\setstretch{2.0}
_Simulate Bayesian Ensemble_
During this step, we simulate the Bayesian model and later feed the prior values into the summary functions in order to verify that there are no inconsistencies with domain knowledge. Since the model is fairly simple, we simulate directly in `R`.
\setstretch{1.0}
```{r ch041-Random Serpent, echo=TRUE}
set.seed(124)
n <- 10000
a <- rnorm(n, 0, 0.06)
b <- rlnorm(n, 3.0, 1)
dat <- with(av_dat, list(N = N, x = x, n = n))
n_obs <- length(dat$x)
idx <- sample(1:n, n_obs, replace = TRUE)
probs <- logistic(b[idx] * (dat$x - a[idx]))
sim_k <- rbinom(n_obs, dat$n, probs)
```
\setstretch{2.0}
_Prior Checks_
This step pertains to ensuring that prior estimates are consistent with domain expertise. We already did that in the model construction step by sampling values for the just noticeable difference. The first prior chosen was not producing JND estimates that were consistent with domain knowledge, so we adjusted accordingly.
Figure \@ref(fig:ch041-prior-pf-plot) shows the distribution of prior psychometric functions derived from the simulated ensemble. There are a few very steep and very shallow curves, but the majority fall within a range that appears likely.
```{r ch041-prior-pf-plot, fig.cap="Prior distribution of psychometric functions using the priors for alpha and beta."}
set.seed(124)
n <- 100
a2 <- rnorm(n, 0, 0.06)
b2 <- rlnorm(n, 3.0, 1)
p <- data.frame(x = c(-0.5, 0.5), y = c(0, 1)) %>% ggplot(aes(x, y)) +
labs(x = "SOA", title = "Prior distribution of Psychometric Functions") +
theme(axis.title.y = element_blank())
for (i in 1:n) {
p <- p +
geom_function(fun = fn, args = list(a = a2[i], b = b2[i]),
color = "steelblue4", alpha = 0.25)
}
p + theme_minimal()
```
Additionally most of the PSS values are within $\pm 0.1$ with room to allow for some larger values. Let's check the prior distribution of PSS and JND values.
```{r ch041-prior-pss-plot, fig.cap="PSS prior distribution."}
ggplot(data.frame(x = Q(0.5, a, b)), aes(x = x, y = ..density..)) +
geom_vline(xintercept = c(-0.15, 0.15), color = "red", linetype = "dashed") +
geom_histogram(bins = 50) +
labs(x = "seconds",
title = "Prior Distribution for the PSS") +
theme_minimal() +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```
```{r ch041-prior-jnd-plot, fig.cap="JND prior distribution."}
ggplot(data.frame(x = Q(0.84, a, b) - Q(0.5, a, b)), aes(x = x, y = ..density..)) +
geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
geom_histogram(bins = 50) +
labs(x = "seconds",
title = "Prior Distribution for the JND") +
scale_x_continuous(limits = c(0, 1.5)) +
theme_minimal() +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```
We are satisfied with the prior coverage of the PSS and JND values, and there are only a few samples that go beyond the extremes that were specified in the summary statistics step.
_Configure Algorithm_
There are a few parameters that can be set for `Stan`. On the user side, the main parameters are the number of iterations, the number of warm-up iterations, the target acceptance rate, and the number of chains to run. By default, `Stan` will use half the number of iterations for warm-up and the other half for actual sampling. For now we use the default algorithm parameters in `Stan`, and will tweak them later if and when issues arise.
_Fit Simulated Ensemble_
We now fit the model to the simulated data.
\setstretch{1.0}
```{r ch041-Minimum Artificial, echo=TRUE}
sim_dat <- with(av_dat, list(N = N, x = x, n = n, k = sim_k))
m041 <- rstan::sampling(m041_stan, data = sim_dat,
chains = 4, cores = 4, refresh = 0)
```
\setstretch{2.0}
_Algorithmic Calibration_
To check the basic diagnostics of the model, we run the following code.
\setstretch{1.0}
```{r ch041-Persistent Hook, message=TRUE, echo=TRUE}
check_hmc_diagnostics(m041)
```
\setstretch{2.0}
There is no undesirable behavior from this model, so next we check the summary statistics of the estimated parameters.
```{r ch041-Cloudy-Toupee}
stan_summary(m041, c("alpha", "beta")) %>%
as_tibble(rownames = "parameter") %>%
kable(digits = 4, booktabs = TRUE,
caption = "Summary statistics of the fitted Bayesian ensemble.") %>%
kable_styling(latex_options = c("hold_position"))
```
Both the $\hat{R}$ and $N_{\mathrm{eff}}$ look fine for both $\alpha$ and $\beta$, though it is slightly concerning that $\alpha$ is centered relatively far from zero. This could just be due to sampling variance, so we will continue on to the next step.
**Post-Model, Post-Data**
_Fit Observed Data_
All of the work up until now has been done without peaking at the observed data. We go ahead and run the data through the model.
```{r ch041-Green Trombone}
obs_dat <- with(av_dat, list(N = N, x = x, n = n, k = k))
if (file.exists("models/m041.rds")) {
m041 <- readRDS("models/m041.rds")
} else {
<<ch041-Homeless-Tea>>
saveRDS(m041, "models/m041.rds")
}
```
\setstretch{1.0}
```{r ch041-Homeless-Tea, echo=TRUE, eval=FALSE}
m041 <- sampling(m041_stan, data = obs_dat,
chains = 4, cores = 4, refresh = 200)
```
\setstretch{2.0}
_Diagnose Posterior Fit_
Here we repeat the diagnostic checks that were used after fitting the simulated data.
\setstretch{1.0}
```{r ch041-Risky Winter, message=TRUE, echo=TRUE}
check_hmc_diagnostics(m041)
```
\setstretch{2.0}
```{r ch041-Maroon-Oyster}
stan_summary(m041, c("alpha", "beta")) %>%
as_tibble(rownames = "parameter") %>%
kable(digits = 4, booktabs = TRUE,
caption = "Summary statistics of the fitted Bayesian ensemble.") %>%
kable_styling(latex_options = c("hold_position"))
```
There are no indications of an ill-behaved posterior fit. Let's also check the posterior distribution of $\alpha$ and $\beta$ against the prior density (\@ref(fig:ch041-m041-posterior-alpha-beta)).
```{r ch041-m041-posterior-alpha-beta, fig.cap="Comparison of posterior distributions for alpha and beta to their respective prior distributions."}
p041 <- extract(m041)
pa <- tibble(x = p041$alpha) %>%
ggplot(aes(x = x, y = ..density..)) +
geom_density(aes(color = "Posterior")) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 0.06),
inherit.aes = FALSE, aes(color = "Prior")) +
scale_x_continuous(limits = c(-0.15, 0.15)) +
scale_color_manual(values = c("steelblue4", "orangered3")) +
theme_minimal() +
theme(legend.title = element_blank(),
legend.position = c(0.25, 0.8),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
labs(title = "alpha")
pb <- tibble(x = p041$beta) %>%
ggplot(aes(x = x, y = ..density..)) +
geom_density(aes(color = "Posterior")) +
stat_function(fun = dlnorm, args = list(meanlog=3, sdlog=1),
inherit.aes = FALSE, aes(color = "Prior")) +
scale_x_continuous(limits = c(0, 15)) +
scale_color_manual(values = c("steelblue4", "orangered3")) +
theme_minimal() +
theme(legend.title = element_blank(),
legend.position = c(0.25, 0.8),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
labs(title = "beta")
pa + pb
```
The posterior distributions for $\alpha$ and $\beta$ are well within the range determined by domain knowledge, and highly concentrated due to both the large amount of data and the fact that this is a completely pooled model -- all subject data is used to estimate the parameters. As expected, the prior for the JND could have been tighter with more weight below half a second compared to the one second limit used, but this is not prior information, so it is not prudent to change the prior in this manner after having seen the posterior. As a rule of thumb, priors should only be updated as motivated by domain expertise and not by the posterior distribution.
_Posterior Retrodictive Checks_
It is time to run the posterior samples through the summary functions and then perform retrodictive checks. A retrodiction is using the posterior model to predict and compare to the observed data. This is simply done by drawing samples from the posterior and feeding in the observational data. This may be repeated to gain posterior predictive samples.
\setstretch{1.0}
```{r ch041-Donut Maximum, echo=TRUE}
posterior_pss <- Q(0.5, p041$alpha, p041$beta)
posterior_jnd <- Q(0.84, p041$alpha, p041$beta) - posterior_pss
```
\setstretch{2.0}
```{r ch041-posterior-pss-jnd-plot, fig.cap="Posterior distribution of the PSS and JND."}
p_pss <- ggplot(data.frame(x = 1000*posterior_pss),
aes(x = x, y = ..density..)) +
geom_histogram(bins = 50) +
theme_minimal() +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(x = "milliseconds",
title = "Posterior Distribution for the PSS")
p_jnd <- ggplot(data.frame(x = 1000*posterior_jnd),
aes(x = x, y = ..density..)) +
geom_histogram(bins = 50) +
theme_minimal() +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(x = "milliseconds",
title = "Posterior Distribution for the JND")
p_pss + p_jnd
```
Neither of the posterior estimates for the PSS or JND exceed the extreme cutoffs set in the earlier steps, so we can be confident that the model is consistent with domain expertise. Note how simple it is to visualize and summarize the distribution of values for these measures. Using classical techniques like MLE might require using bootstrap methods to estimate the distribution of parameter values, or one might approximate the parameter distributions using the mean and standard error of the mean to simulate new values.
Next is to create the posterior predictive samples. We do this in two steps to better show how the distribution of posterior psychometric functions relates to the observed data, and then compare the observed data to the retrodictions. Figure \@ref(fig:ch041-posterior-pf-plot) shows the result of the first step.
```{r ch041-posterior-pf-plot, fig.cap="Posterior distribution of psychometric functions using pooled observations."}
set.seed(124)
n <- 100
idx <- sample(1:length(p041$alpha), n, replace = TRUE)
a2 <- p041$alpha[idx]
b2 <- p041$beta[idx]
p <- data.frame(x = c(-0.5, 0.5), y = c(0, 1)) %>% ggplot(aes(x, y)) +
labs(x = "SOA (seconds)", title = "Posterior distribution of Psychometric Functions") +
theme_minimal() +
theme(axis.title.y = element_blank())
for (i in 1:n) {
p <- p +
geom_function(fun = fn, args = list(a = a2[i], b = b2[i]),
color = "steelblue4", alpha = 0.15)
}
df <- with(av_dat, data.frame(x = x, y = k/n))
p +
geom_point(data = df, aes(x = x, y = y), inherit.aes = FALSE,
alpha = 0.25)
```
Next we sample parameter values from the posterior distribution and use them to simulate a new data set. In the next iteration we show how to get `Stan` to automatically produce posterior predictive samples from the model fitting step. The results of the posterior predictions are shown in figure \@ref(fig:ch041-obs-vs-retro-plot).
\setstretch{1.0}
```{r ch041-Intensive Trendy, echo=TRUE}
alpha <- sample(p041$alpha, n_obs, replace = TRUE)
beta <- sample(p041$beta, n_obs, replace = TRUE)
logodds <- beta * (av_dat$x - alpha)
probs <- logistic(logodds)
sim_k <- rbinom(n_obs, av_dat$n, probs)
```
\setstretch{2.0}
```{r ch041-obs-vs-retro-plot, fig.cap="Observed data compared to the posterior retrodictions. The data is post-stratified by block for easier visualization."}
bind_rows(tibble(x = av_dat$x, n = av_dat$n, k = sim_k, p = k/n,
block = av_dat$block, Data = "Retrodictions"),
tibble(x = av_dat$x, n = av_dat$n, k = av_dat$k, p = k/n,
block = av_dat$block, Data = "Observations")) %>%
ggplot(aes(x = x, y = p)) +
geom_point(alpha = 0.25) +
scale_x_continuous(limits = c(-0.5, 0.5), breaks = seq(-0.5, 0.5, 0.25)) +
scale_y_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) +
labs(x = "SOA (seconds)", y = "response probability",
title = "Observed Data with Posterior Retrodictions") +
facet_grid(block ~ Data) +
theme_bw()
```
Let's be clear what the first iteration of this model describes. It is the average distribution of underlying psychometric functions across all subjects and blocks. It cannot tell us what the differences between pre- and post-adaptation blocks are, or what the variation between subjects is. It is useful in determining if the average value for the PSS is different from 0 or if the average JND is different from some other predetermined level. This model is still useful given the right question, but it cannot answer questions about group-level effects.
Figure \@ref(fig:ch041-obs-vs-retro-plot) shows that the model captures the broad structure of the observed data, but is perhaps a bit under-dispersed in the tail ends of the SOA values. Besides this one issue, we are satisfied with the first iteration of this model and are ready to proceed to the next iteration.
## Iteration 2: adding age and block {#iter2}
In this iteration we add the treatment and age groups into the model. There are no changes with the conceptual understanding of the experiment, and nothing to change with the observational space. As such we skip the first three steps and go to the model development step. As we build the model, the number of changes from one iteration to the next should go to zero as the model expands to become only as complex as necessary to answer the research questions.
**Post-Model, Pre-Data**
_Develop Model_
To start, let's add in the treatment indicator and put off consideration of adding in the age group levels. In classical statistics, it is added as an indicator variable -- a zero or one -- for both the slope and intercept (varying slopes, varying intercepts model). Let $trt$ be $0$ if it is the pre-adaptation block and $1$ if the observation comes from the post-adaptation block.
\setstretch{1.0}
$$\theta = \alpha + \alpha_{trt} \times trt + \beta \times x + \beta_{trt}\times trt \times x$$
\setstretch{2.0}
Now when an observation comes from the pre-adaptation block ($trt=0$) the linear predictor is given by
\setstretch{1.0}
$$\theta_{pre} = \alpha + \beta \times x$$
\setstretch{2.0}
and when an observation comes from the post-adaptation block ($trt=1$) the linear predictor is
\setstretch{1.0}
$$\theta_{post} = (\alpha + \alpha_{trt}) + (\beta + \beta_{trt}) \times x$$
\setstretch{2.0}
This may seem like a natural way to introduce an indicator variable, but it comes with serious implications. This model implies that there is more uncertainty about the post-adaptation block compared to the baseline block, and this is not necessarily true.
\setstretch{1.0}
\begin{align*}
\mathrm{Var}(\theta_{post}) &= \mathrm{Var}((\alpha + \alpha_{trt}) + (\beta + \beta_{trt}) \times x) \\
&= \mathrm{Var}(\alpha) + \mathrm{Var}(\alpha_{trt}) + x^2 \mathrm{Var}(\beta) + x^2\mathrm{Var}(\beta_{trt})
\end{align*}
\setstretch{2.0}
On the other hand, the variance of $\theta_{pre}$ is:
\setstretch{1.0}
$$
\mathrm{Var}(\theta_{pre}) = \mathrm{Var}(\alpha) + x^2 \mathrm{Var}(\beta) \le \mathrm{Var}(\theta_{post})
$$
\setstretch{2.0}
Furthermore, the intercept, $\alpha$, is no longer the average response probability at $x=0$ for the entire data set, but is instead exclusively the average for the pre-adaptation block. This may not matter in certain analyses, but one nice property of multilevel models is the separation of population level estimates (fixed effects) and group level estimates (mixed effects).
Instead the treatment variable is introduced into the linear model as a factor variable. This means that each level in the treatment gets its own parameter, and this makes it easier to set priors when there are many levels in a group (such as for the subject level). The linear model, using equation \@ref(eq:linearform2), with the treatment is written as:
\setstretch{1.0}
\begin{equation}
\theta = (\beta + \beta_{trt[i]}) \left[x_i - (\alpha + \alpha_{trt[i]})\right]
(\#eq:linearmodel2)
\end{equation}
\setstretch{2.0}
As more predictors and groups are added in, equation \@ref(eq:linearmodel2) becomes more difficult to read. What we do is break up the slope and intercept parameters and write the linear model as:
\setstretch{1.0}
\begin{align*}
\mu_\alpha &= \alpha + \alpha_{trt[i]} \\
\mu_\beta &= \beta + \beta_{trt[i]} \\
\theta &= \mu_\beta (x - \mu_\alpha)
\end{align*}
\setstretch{2.0}
In this way the combined parameters can be considered separately from the linear parameterization. Now we consider the priors for $\alpha_{trt}$. Equation \@ref(eq:alpha-three-forms) shows three ways of adding in the block variable for $\alpha$. The left equation is a standard single-level predictor, the center equation is the centered parameterization for a multilevel predictor, and the right equation is the non-centered parameterization for a multilevel predictor.
\setstretch{1.0}
\begin{equation}
\begin{split}
\mu_\alpha &= \alpha_{trt[i]} \\
\alpha_{trt} &\sim \mathcal{N}(0, 0.06^2)
\end{split}
\qquad
\begin{split}
\mu_\alpha &= \alpha_{trt[i]} \\
\alpha &\sim \mathcal{N}(0, 0.06^2) \\
\alpha_{trt} &\sim \mathcal{N}(\alpha, \sigma_{trt}^2) \\
\sigma_{trt} &\sim \pi_{\sigma}
\end{split}
\qquad
\begin{split}
\mu_\alpha &= \alpha + \alpha_{trt[i]} \\
\alpha &\sim \mathcal{N}(0, 0.06^2) \\
\alpha_{trt} &\sim \mathcal{N}(0, \sigma_{trt}^2) \\
\sigma_{trt} &\sim \pi_{\sigma}
\end{split}
(\#eq:alpha-three-forms)
\end{equation}
\setstretch{2.0}
In the center and right models of equation \@ref(eq:alpha-three-forms), $\alpha$ gets a fixed prior (the same as in the first iteration), and $\alpha_{trt}$ gets a Gaussian prior with an adaptive variance term that is allowed to be learned from the data. This notation is compact, but $\alpha_{trt}$ is two parameters - one each for the blocks. They both share the same variance term $\sigma_{trt}$.
We will discuss selecting a prior for the variance term shortly. Instead of modeling $\beta$ with a log-normal prior, we can sample from a normal distribution and take the exponential of it to produce a log-normal distribution:
\setstretch{1.0}
\begin{align*}
X &\sim \mathcal{N}(3, 1^2) \\
Y &= \exp\left(X\right) \Longleftrightarrow Y \sim \mathrm{Lognormal(3, 1^2)}
\end{align*}
\setstretch{2.0}
This is the non-centered parameterization of the log-normal distribution, and the motivation behind this parameterization is that it is now easier to include new slope variables as an additive affect. If both $\beta$ and $\beta_{trt}$ are specified with Gaussian priors, then the exponential of the sum will be a log-normal distribution. The model now becomes:
\setstretch{1.0}
\begin{equation}
\begin{split}
k_i &\sim \mathrm{Binomial}(n_i, p_i) \\
\mathrm{logit}(p_i) &= \exp(\mu_\beta) (x_i - \mu_\alpha) \\
\mu_\alpha &= \alpha + \alpha_{trt[i]} \\
\mu_\beta &= \beta + \beta_{trt[i]} \\
\alpha &\sim \mathcal{N}(0, 0.06^2) \\
\alpha_{trt} &\sim \mathcal{N}(0, \sigma_{trt}^2) \\
\beta &\sim \mathcal{N}(3, 1^2) \\
\beta_{trt} &\sim \mathcal{N}(0, \gamma_{trt}^2) \\
\sigma_{trt} &\sim \pi_{\sigma} \\
\gamma_{trt} &\sim \pi_{\gamma}
\end{split}
(\#eq:iter2-partial-model)
\end{equation}
\setstretch{2.0}
Deciding on priors for the variance term requires some careful consideration. In one sense, the variance term is the within-group variance. @gelman2006prior recommends that for multilevel models with groups with less than say 5 levels to use a half-Cauchy prior. This weakly informative prior still has a regularizing affect and dissuades larger variance estimates. Even though the treatment group only has two levels, there is still value in specifying an adaptive prior for them, and there is enough data for each treatment so that partial pooling won't have as extreme of a regularizing effect.
\setstretch{1.0}
\begin{align*}
\sigma_{trt} &\sim \mathrm{HalfCauchy}(0, 1) \\
\gamma_{trt} &\sim \mathrm{HalfCauchy}(0, 1)
\end{align*}
\setstretch{2.0}
We add in the age group level effects to \@ref(eq:iter2-partial-model) and specify their variance terms:
\setstretch{1.0}
\begin{equation}
\begin{split}
k_i &\sim \mathrm{Binomial}(n_i, p_i) \\
\mathrm{logit}(p_i) &= \exp(\mu_\beta) (x_i - \mu_\alpha) \\
\mu_\alpha &= \alpha + \alpha_{trt[i]} + \alpha_{G[i]} \\