forked from marjoleinF/GLMMtree_webinar
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Scripts.Rmd
286 lines (203 loc) · 15.2 KB
/
Scripts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
title: "Fitting GLMM trees in R"
subtitle: "R scripts accompanying the webinar for University of Montreal"
date: "06-05-2021"
author: "Marjolein Fokkema"
output: pdf_document
---
# Model formulas
The GLMM is given by:
$E[y_{ij} | x_{ij}] = \mu_{ij}$
$g(\mu_{ij}) = x_{ij}^{\top}\beta_{k} + z_{ij}^{\top}b_i$
Where $i$ is an identifier for the level-II unit, and $j$ is an identifier for the level-I unit.
If we have a continuous response variable with normally distributed residuals, then $g$ is the identity function and we have:
$y_{ij} = x_{ij}^{\top}\beta + z_{ij}^{\top}b_i + \epsilon_{ij}$
We thus have a fixed-effects part ($x_{ij}^{\top}\beta$), a random-effects part ($z_{ij}^{\top}b_i$) and a residual error term ($\epsilon_{ij}$).
The GLMM tree model differs in that the fixed-effect part may differ between subgroups:
$y_{ij} = x_{ij}^{\top}\beta_{k} + z_{ij}^{\top}b_i + \epsilon_{ij}$
Thus, $k$ is an identifier for the subgroup. The GLMM tree algorithm finds these subgroups $k$, using additional covariates. These may be measured on the lowest level $i$ (which is commonly encountered in cross-sectional multilevel data), or on the higher level $j$ (which is commonly encountered in longitudinal data).
The subgroups $k$, and parameters $\beta$ and $b$ cannot be estimated in a single step. An iterative approach is taken, where the model-base recursive partitioning algorithm of Zeileis, Hothorn & Hornk (2008) is used to estimate the subgroups $k$, and the usual (restricted) maximum likelihood approach is used to estimate the fixed- and random-effects parameters, see Fokkema et al. (2018).
\newpage
# Example: Treatment subgroups
For this example, we will make use of artificial data modelled after the Stratified Medicine Approaches foR Treatment Selection (SMART) prediction tournament, from the Improving Access to Psychological Therapies (IAPT) project (Lucock et al., 2017). The SMART data contains data from patients receiving mental-health services in the Northern UK. Patients were (non-randomly) assigned to low intensity treatment (e.g., guided self-help, computerized cognitive behavior therapy) or high intensity treatment (e.g., face-to-face psychological therapies). The aim of the SMART tournament was to identify patients who would benefit most from HI vs. LI treatment. I do not own the data, so I generated an artificial dataset which mimics the original data, available in the file "SMART mimic data.txt".
```{r, message=FALSE, warning=FALSE}
library("glmertree")
SMART <- read.table("SMART mimic data.txt", stringsAsFactors = TRUE)
names(SMART)
trt_tree <- glmertree(recovered ~ Treatment | center | Age + PHQ9_pre +
GAD7_pre + WSAS_pre + Gender + Ethnicity +
Diagnosis + Employment + Disability + Medication,
data = SMART, family = binomial)
```
We can print and plot the results as follows:
```{r, fig.width=4.5, fig.height=4, message = FALSE, warning = FALSE}
trt_tree$tree
plot(trt_tree, which = "tree", gp = gpar(cex = .6))
```
```{r, fig.width=3, fig.height=3, message = FALSE, warning = FALSE}
plot(trt_tree, which = "ranef")
```
```{r}
VarCorr(trt_tree)
```
```{r}
fixef(trt_tree)
```
\newpage
# Example: Alcohol trajectories
Curran, Stice, and Chassin (1997) collected data on 82 adolescents at three time points starting at age 14 to assess factors that affect teen drinking behavior. Key variables in the data set "alcohol.csv" (accessed via Singer and Willett, 2003) are as follows:
* `id` = numerical identifier for subject
* `age` = 14, 15, or 16
* `coa` = 1 if the teen is a child of an alcoholic parent; 0 otherwise
* `male` = 1 if male; 0 if female
* `peer` = a measure of peer alcohol use, taken when each subject was 14. This is the square root of the sum of two 6-point items about the proportion of friends who drink occasionally or regularly.
* `alcuse` = the primary response. Four items—(a) drank beer or wine, (b) drank hard liquor, (c) 5 or more drinks in a row, and (d) got drunk—were each scored on an 8-point scale, from 0=“not at all” to 7=“every day”. Then alcuse is the square root of the sum of these four items.
Primary research questions included: Do trajectories of alcohol use differ by parental alcoholism? Do trajectories of alcohol use differ by peer alcohol use?
```{r}
alco <- read.table("alcohol.csv", header= TRUE, sep = ",",
stringsAsFactors = TRUE)[ , -1]
summary(alco)
alco$age <- alco$age - 14L
```
Fit and plot a fixed-effects linear model (note: I used only a subset of the data to keep plot readable) :
```{r, fig.width=4, fig.height=3}
set.seed(42)
ids <- sample(unique(alco$id), 10)
cols <- colorspace::rainbow_hcl(length(ids))
alc <- alco[alco$id %in% ids, ]
LM <- lm(alcuse ~ age, data = alc)
summary(LM)
beta <- coef(LM)
plot(jitter(alc$age, 0.5), jitter(alc$alcuse), col = cols, pch = 20,
xlab = "age - 14", ylab = "alcohol use")
abline((beta[1]), beta[2], cex = 3)
```
Fit and plot linear model tree (note: here I used the full dataset):
```{r}
library("partykit")
lt <- lmtree(alcuse ~ age | coa + male + peer, data = alco)
plot(lt, gp = gpar(cex = 0.7))
```
Fit and plot linear mixed-effects model (again, I used only a subset of the original data):
```{r, fig.width=4, fig.height=3}
LMM <- lmer(alcuse ~ age + (age|id), data = alc)
summary(LMM)
beta <- fixef(LMM)
b <- ranef(LMM)$id
plot(jitter(alc$age, 0.5), jitter(alc$alcuse), col = cols, pch = 20,
xlab = "age - 14", ylab = "alcohol use")
abline((beta[1]), beta[2], cex = 3)
for (j in 1:length(ids)) {
abline((b+beta)[j,1], (b+beta)[j,2], col = cols[j])
}
```
Fit and plot a linear mixed-effects model tree:
```{r}
lt <- lmertree(alcuse ~ age | (age|id) | coa + male + peer, data = alco,
cluster = id)
```
Note that because the possible partitioning variables are measured on level II, we use the `cluster` argument to specify this to the algorithm (otherwise, the parameter stability tests will assume partitioning variables are measured at level I and the tests are likely overpowered).
```{r, fig.width=5, fig.height=4}
plot(lt, fitted = "marginal", gp = gpar(cex = 0.7))
```
I specified `fitted = "marginal", so the lines in the terminal node represent the effect of time, while fixing all other predictors (fixed and random, but not the potential partitioning variables) at their means.
The plots of the random effects look quite messy, because of the large number of study participants. We can request the (co)variances of the random effects as follows:
```{r}
fixef(lt)
VarCorr(lt)
```
\newpage
# Example: Stage fright trajectories
Sadler and Miller (2010) studied the emotional state of musicians before performances and factors which may affect their emotional state. Data was collected among 37 undergraduate music majors from a competitive undergraduate music program. They filled out diaries prior to performances over the course of an academic year. Specifically, participants completed a Positive Affect Negative Affect Schedule (PANAS) before each performance, providing two key outcomes: negative affect (`na`, a state measure of anxiety) and positive affect (`pa`, a state measure of happiness).
Factors which were examined for their potential relationships with performance anxiety included: performance type (solo, large ensemble, or small ensemble); audience (instructor, public, students, or juried); if the piece was played from memory; age; gender; instrument (voice, orchestral, or keyboard); and years studying the instrument. In addition, the personalities of study participants were assessed at baseline through the Multidimensional Personality Questionnaire (MPQ). The MPQ provided scores for one lower-order factor (absorption) and three higher-order factors: positive emotionality (PEM); negative emotionality (NEM); and constraint.
Here, we look at trajectories of negative affect scores over the course of repeated assessments at solo performances.
```{r}
music <- read.table("musicdata.csv", header=T, sep=",",
stringsAsFactors = TRUE)[ , -1]
summary(music)
music <- music[music$perform_type == "Solo", ]
levels(music$instrument) <- c("keyboard", "orch_instr", "voice")
```
* `id` = unique musician identification number
* `diary` = cumulative total of diaries filled out by musician (level I; timing metric)
* `audience` = who attended performance (Instructor, Public, Students, or Juried) (level I)
* `na` = negative affect score from PANAS (level I)
* `gender` = musician gender (level II)
* `instrument` = Voice, Orchestral, or Piano (level II)
* `mpqab` = absorption subscale from MPQ (level II)
* `mpqpem` = positive emotionality (PEM) composite scale from MPQ (level II)
* `mpqnem` = negative emotionality (NEM) composite scale from MPQ (level II)
* `mpqcon` = constraint scale from MPQ (level II)
Here, we fit both random intercepts and slopes of time (`diary`). These random intercepts and slopes are deviations from the subgroup-specific intercept and slope:
```{r, warning=FALSE, message=FALSE}
lmmt1 <- lmertree(na ~ diary | (diary|id) | gender + instrument +
mpqab + mpqpem + mpqnem + mpqcon, data = music, cluster = id)
plot(lmmt1, fitted = "marginal", which = "tree", gp = gpar(cex = .7))
fixef(lmmt1, which = "tree")
VarCorr(lmmt1)
```
We can also account for time-varying covariates. For example, the variable `audience` has a strong effet on negative affect, and should be accounted for. Time-varying predictors can either be included in the subgroup-specific model, which we would do if we expect and are interested in possible between-subgroup differences, but note that this makes the resulting subgroups more difficult to interpret, because they can differ in terms of several parameters. So it might be better to globally correct/account for the effect of `audience`:
```{r, warning=FALSE, message=FALSE}
lmmt2 <- lmertree(na ~ diary | audience + (diary|id) | gender + instrument +
mpqab + mpqpem + mpqnem + mpqcon, data = music, cluster = id)
plot(lmmt2, fitted = "marginal", which = "tree", gp = gpar(cex = .7))
fixef(lmmt2, which = "tree")
fixef(lmmt2, which = "global")
VarCorr(lmmt2)
```
\newpage
# Further reading
The vignette of package **`glmertree`** provides further info on how the GLMM tree models can be further customized, and checks on model fit can be performed. You can access it on https://cran.r-project.org/web/packages/glmertree/vignettes/glmertree.pdf or in **`R`** by typing:
```{r, eval=FALSE}
vignette("glmertree")
```
Furthermore, Fokkema, Smits, Zeileis, Hothorn & Kelderman (2018) provides an in-depth technical discussion of GLMM trees, while Fokkema, Edbrooke-Childs & Wolpert (2020) provides a less technical introduction.
# From the future: GAM trees
The GAM tree package allows for detecting subgroups in non-linear trajectories. However, it is still experimental. Instead of fitting GLMMs, it fit GAMs, which allows for fitting non-linear smoothing splines, as well as random effects. For an in-depth discussion of GAMs, see Wood (2017). Or Chapter 7 of James et al. (2013) for a less formal, more introductory version (you can download that book freely via https://www.statlearning.com/).
The use of GAM tree assumes some familiarity with **`R`** package **`mgcv`**. You can install the current development version as follows:
```{r, eval=FALSE}
library("devtools")
install_github("marjoleinF/gamtree")
```
We now fit non-linear trajectories to the alcohol use data. To fit a smooth curve to a predictor, we use the `s()` function (type `?mgcv::gam` and `?mgcv::s` for more info). For fitting random effects, we use the same function, and specify that we `bs = "re"` to indicate we want to estimate a random-effect, not a non-linear curve:
```{r, fig.width=4, fig.height=3, warning=FALSE, message=FALSE}
library("gamtree")
alco$id <- factor(alco$id)
gt <- gamtree(alcuse ~ s(age, k = 3) | s(id, bs = "re") | coa + male + peer,
data = alco, cluster = alco$id, verbose = FALSE)
plot(gt, which = "tree", treeplot_ctrl = list(gp = gpar(cex = .7)))
```
```{r, fig.width=4.5, fig.height=3, warning=FALSE, message=FALSE}
plot(gt, which = "nodes")
```
The fitted (marginal) curves in the tree indicate no or little non-linearity in either of the resulting subgroups (nodes). The groups seem to mostly differ in terms of the intercept and linear slope, not in terms of the shape of the effect of age. Note also the similarity to the GLMM tree we fitted earlier. Only the conditional effects plotted later indicate slight non-linearity in the node-3 subgroup.
```{r}
summary(gt)
```
We can also fit non-linear curves to the stage-fright data:
```{r, fig.width=4, fig.height=3}
music$id <- factor(music$id)
gamt1 <- gamtree(na ~ s(diary) | audience + s(id, bs = "re") | gender +
instrument + mpqab + mpqpem + mpqnem + mpqcon, data = music,
verbose = FALSE, cluster = music$id)
plot(gamt1, which = "tree", treeplot_ctrl = list(gp = gpar(cex = .7)))
```
```{r, fig.width=4.5, fig.height=3}
plot(gamt1, which = "nodes")
```
```{r}
summary(gamt1)
```
We see some evidence of non-linearity: Those with lower negative emotionality at baseline (node 3) show a decrease and then an increase over the academic year. Those with higher levels of negative emotionality at baseline (node 4) show a stronger, more steady decrease of negative affect during performance, over the course of the academic year.
Note that the $p$-values should be taken with a (large) grain of salt, because they do not account for the searching of the subgroups.
\newpage
# References
Curran, P. J., Stice, E., & Chassin, L. (1997). The relation between adolescent alcohol use and peer alcohol use: a longitudinal random coefficients model. *Journal of Consulting and Clinical Psychology, 65*(1), 130.
Fokkema, M., Smits, N., Zeileis, A., Hothorn, T., & Kelderman, H. (2018). Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees. *Behavior Research Methods, 50*(5), 2016-2034. https://doi.org/10.3758/s13428-017-0971-x
Fokkema, M., Edbrooke-Childs, J., & Wolpert, M. (2020). Generalized linear mixed-model (GLMM) trees: A flexible decision-tree method for multilevel and longitudinal data. *Psychotherapy Research,31*(3), 329-341. https://doi.org/10.1080/10503307.2020.1785037
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). Introduction to Statistical Learning. New York: springer.
Lucock, M., Barkham, M., Donohoe, G., Kellett, S., McMillan, D., Mullaney, S., ... & Delgadillo, J. (2017). The role of Practice Research Networks (PRN) in the development and implementation of evidence: The Northern improving access to psychological therapies PRN case study. *Administration and Policy in Mental Health and Mental Health Services Research, 44*(6), 919-931.
Sadler, M. E., & Miller, C. J. (2010). Performance anxiety: A longitudinal study of the roles of personality and experience in musicians. *Social Psychological and Personality Science, 1*(3), 280-287.
Singer, J.D. & Willett, J.B. (2003). *Applied longitudinal data analysis: Modeling change and event occurrence.* Oxford University Press.
Wood, S. N. (2017). Generalized additive models: an introduction with R. CRC press.
Zeileis, A., Hothorn, T., & Hornik, K. (2008). Model-based recursive partitioning. *Journal of Computational and Graphical Statistics, 17*(2), 492-514. https://doi.org/10.1198/106186008X319331