Linear-models-overview.qmd

---
df-print: paged
---

# Linear (Gaussian) Models

---

{{< include shared-config.qmd >}}

---

:::{.callout-note}
This content is adapted from:

- @dobson4e, Chapters 2-6
- @dunn2018generalized, Chapters 2-3
- @vittinghoff2e, Chapter 4

There are numerous textbooks specifically for linear regression, including:

- @kutner2005applied: used for UCLA Biostatistics MS level linear models class
- @chatterjee2015regression: used for Stanford MS-level linear models class
- @seber2012linear: used for UCLA Biostatistics PhD level linear models class and UC Davis STA 108.
- @kleinbaum2014applied: same first author as @kleinbaum2010logistic and @kleinbaum2012survival
- @weisberg2005applied
- *Linear Models with R* [@Faraway2025-io]

:::

## Overview

### Why this course includes linear regression {.smaller}

:::{.fragment .fade-in-then-semi-out}
* This course is about *generalized linear models* (for non-Gaussian outcomes)
:::

:::{.fragment .fade-in-then-semi-out}
* UC Davis STA 108 ("Applied Statistical Methods: Regression Analysis") is a prerequisite for this course, so everyone here should have some understanding of linear regression already.
:::

:::{.fragment .fade-in}
* We will review linear regression to:
    - make sure everyone is caught up
    - to provide an epidemiological perspective on model interpretation.
:::

### Chapter overview

* @sec-understand-LMs: how to interpret linear regression models

* @sec-est-LMs: how to estimate linear regression models

* @sec-infer-LMs: how to quantify uncertainty about our estimates

* @sec-diagnose-LMs: how to tell if your model is insufficiently complex


## Understanding Gaussian Linear Regression Models {#sec-understand-LMs}

### Motivating example: birthweights and gestational age {.smaller}

Suppose we want to learn about the distributions of birthweights (*outcome* $Y$) for (human) babies born at different gestational ages (*covariate* $A$) and with different chromosomal sexes (*covariate* $S$) (@dobson4e Example 2.2.2).

::::: {.panel-tabset}

#### Data as table

```{r}
#| label: tbl-birthweight-data1
#| tbl-cap: "`birthweight` data (@dobson4e Example 2.2.2)"
library(dobson)
data("birthweight", package = "dobson")
birthweight |> knitr::kable()
```

#### Reshape data for graphing

```{r}
#| label: tbl-birthweight-data2
#| tbl-cap: "`birthweight` data reshaped"
bw = 
  birthweight |> 
  pivot_longer(
    cols = everything(),
    names_to = c("sex", ".value"),
    names_sep = "s "
  ) |> 
  rename(age = `gestational age`) |> 
  mutate(
    sex = sex |> 
      case_match(
        "boy" ~ "male",
        "girl" ~ "female") |> 
      factor(levels = c("female", "male")))

bw
```

#### Data as graph

```{r}
#| label: fig-plot-birthweight1
#| fig-cap: "`birthweight` data (@dobson4e Example 2.2.2)"
#| fig-height: 5
#| fig-width: 7
plot1 = bw |> 
  ggplot(aes(
    x = age, 
    y = weight,
    linetype = sex,
    shape = sex,
    col = sex))  +
  theme_bw() +
  xlab("Gestational age (weeks)") +
  ylab("Birthweight (grams)") +
  theme(legend.position = "bottom") +
  # expand_limits(y = 0, x = 0) +
  geom_point(alpha = .7)
print(plot1 + facet_wrap(~ sex))
```

:::::

---

#### Data notation

Let's define some notation to represent this data.

-   $Y$: birthweight (measured in grams)
-   $S$: chromosomal sex: "male" (XY) or "female" (XX)
-   $M$: indicator variable for $S$ = "male"^[$M$ is implicitly a deterministic function of $S$]
-   $M = 0$ if female (XX)
-   $M = 1$ if male (XY)
-   $F$: indicator variable for $S$ = "female"^[$F$ is implicitly a deterministic function of $S$]
-   $F = 1$ if female (XX)
-   $F = 0$ if male (XY)

-   $A$: estimated gestational age at birth (measured in weeks).

::: callout-note
Female is the **reference level** for the categorical variable $S$ 
(chromosomal sex) and corresponding indicator variable $M$ . 
The choice of a reference level is arbitrary and does not limit what 
we can do with the resulting model; 
it only makes it more computationally convenient to make inferences 
about comparisons involving that reference group.
:::

### Parallel lines regression

We don't have enough data to model the distribution of birth weight 
separately for each combination of gestational age and sex, 
so let's instead consider a (relatively) simple model for how that 
distribution varies with gestational age and sex:

$$p(Y=y|A=a,S=s) \siid N(\mu(a,s), \sigma^2)$$

$$
\ba
\mu(a,s)
&\eqdef \Exp{Y|A=a, S=s} \\
&= \beta_0 + \beta_A a+ \beta_M m
\ea
$$ {#eq-lm-parallel}

:::{.notes}

@tbl-lm-parallel shows the parameter estimates from R.
@fig-parallel-fit1 shows the estimated model, superimposed on the data.

:::

::: {.column width=40%}

```{r}
#| label: tbl-lm-parallel
#| tbl-cap: "Estimate of [Model @eq-lm-parallel] for `birthweight` data"

bw_lm1 = lm(
  formula = weight ~ sex + age, 
  data = bw)

bw_lm1 |> 
  parameters() |>
  print_md(
    include_reference = TRUE,
    # show_sigma = TRUE,
    select = "{estimate}")

```

:::

:::{.column width=10%}
:::

:::{.column width=50%}

```{r}
#| label: fig-parallel-fit1
#| fig-cap: "Parallel-slopes model of birthweight"
bw = 
  bw |> 
  mutate(`E[Y|X=x]` = fitted(bw_lm1)) |> 
  arrange(sex, age)

plot2 = 
  plot1 %+% bw +
  geom_line(aes(y = `E[Y|X=x]`))

print(plot2)

```

:::

---

#### Model assumptions and predictions

::: notes
To learn what this model is assuming, let's plug in a few values.
:::

::: {#exr-pred-fem-parallel}

According to this model, what's the mean birthweight for a female born at 36 weeks?

```{r}
#| tbl-cap: "Estimated coefficients for [model @eq-lm-parallel]"
#| label: tbl-coef-model1
coef(bw_lm1)
```

:::

---

:::{.solution}
\ 
```{r}
pred_female = coef(bw_lm1)["(Intercept)"] + coef(bw_lm1)["age"]*36
coef(bw_lm1)
# print(pred_female)
### built-in prediction: 
# predict(bw_lm1, newdata = tibble(sex = "female", age = 36))
```

$$
\ba
E[Y|A = 0, A = 36] 
&= \beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36 \\
&=  `r pred_female`
\ea
$$
:::

---

:::{#exr-pred-male-parallel}

What's the mean birthweight for a male born at 36 weeks?

```{r}
coef(bw_lm1)
```

:::

---

:::{.solution}
\ 

```{r}
pred_male = 
  coef(bw_lm1)["(Intercept)"] + 
  coef(bw_lm1)["sexmale"] + 
  coef(bw_lm1)["age"]*36
coef(bw_lm1)
```

$$
\ba
E[Y|M = 1, A = 36] 
&= \beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 \\
&=  `r pred_male`
\ea
$$

:::

---

:::{#exr-diff-sex-parallel-1}
What's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?
:::

```{r}
coef(bw_lm1)
```

---

:::{.solution}

$$
\begin{aligned}
& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\
&= 
`r pred_male` - `r pred_female`\\
&=
`r pred_male - pred_female`
\end{aligned}
$$

Shortcut:

$$
\begin{aligned}
& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\
&= (\beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36) - 
(\beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36) \\
&= \beta_M \\ 
&=  `r coef(bw_lm1)["sexmale"]`
\end{aligned}
$$

:::

:::{.notes}

Note that age doesn't show up in this difference: in other words, according to this model, the difference between females and males with the same gestational age is the same for every age.

That's an assumption of the model; it's built-in to the parametric structure, even before we plug in the estimated values of those parameters.

That's why the lines are parallel.

:::

### Interactions {.smaller}

:::{.notes}
What if we don't like that parallel lines assumption?

Then we need to allow an "interaction" between  age $A$ and sex $S$:
:::

$$
E[Y|A=a, S=s] = \beta_0 + \beta_A a+ \beta_M m + \beta_{AM} (a \cdot m)
$$ {#eq-BW-lm-interact}

::: notes
Now, the slope of mean birthweight $E[Y|A,S]$ with respect to gestational age $A$ depends on the value of sex $S$.
:::

::: {.column width=40% .smaller}

```{r}
#| label: tbl-bw-model-coefs-interact
#| tbl-cap: "Birthweight model with interaction term"

bw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)
bw_lm2 |> 
  parameters() |>
  print_md(
    include_reference = TRUE,
    # show_sigma = TRUE,
    select = "{estimate}")

```

:::

:::{.column width=5%}
:::

:::{.column width=55%}

```{r}
#| label: fig-bw-interaction
#| fig-cap: "Birthweight model with interaction term"
bw = 
  bw |> 
  mutate(
    predlm2 = predict(bw_lm2)
  ) |> 
  arrange(sex, age)

plot1_interact = 
  plot1 %+% bw +
  geom_line(aes(y = predlm2))

print(plot1_interact)
```

:::

::: {.notes}
Now we can see that the lines aren't parallel.
:::

---

Here's another way we could rewrite this model (by collecting terms involving $S$):

$$
E[Y|A, M] = \beta_0 + \beta_M M+ (\beta_A + \beta_{AM} M) A
$$

::: callout-note
If you want to understand a coefficient in a model with interactions, collect terms for the corresponding variable, and you will see what other variables are interacting with the variable you are interested in.
:::

:::{.notes}
In this case, the coefficient $S$ is interacting with $A$. So the slope of $Y$ with respect to $A$ depends on the value of $M$.

According to this model, there is no such thing as "*the* slope of birthweight with respect to age". There are two slopes, one for each sex.^[using the definite article "the" would mean there is only one slope.] We can only talk about "the slope of birthweight with respect to age among males" and "the slope of birthweight with respect to age among females".

Then: that coefficient is the difference in means per unit change in its corresponding coefficient, when the other collected variables are set to 0.
:::

---

::: notes
To learn what this model is assuming, let's plug in a few values.
:::

:::{#exr-pred-fem-interact}
According to this model, what's the mean birthweight for a female born at 36 weeks?
:::

---

::: {.solution}
\ 

```{r}
pred_female = coef(bw_lm2)["(Intercept)"] + coef(bw_lm2)["age"]*36
```

$$
E[Y|A = 0, X_2 = 36] = 
\beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36 + \beta_{AM} \cdot (0 * 36) 
=  `r pred_female`
$$ 

:::

---

:::{#exr-pred-interact-male_36}
What's the mean birthweight for a male born at 36 weeks?

:::

---

::: solution
\ 

```{r}
pred_male = 
  coef(bw_lm2)["(Intercept)"] + 
  coef(bw_lm2)["sexmale"] + 
  coef(bw_lm2)["age"]*36 + 
  coef(bw_lm2)["sexmale:age"] * 36
```

$$
\ba
E[Y|A = 0, X_2 = 36]
&= \beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 + \beta_{AM} \cdot 1 \cdot 36\\
&= `r pred_male`
\ea
$$

:::

---

:::{#exr-diff-gender-interact}
What's the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?
:::

---

:::{.solution}

$$
\begin{aligned}
& E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\ 
&= (\beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 + \beta_{AM} \cdot 1 \cdot 36)\\ 
&\ \ \ \ \  -(\beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36 + \beta_{AM} \cdot 0 \cdot 36) \\
&= \beta_{S} + \beta_{AM}\cdot 36\\
&=  `r pred_male - pred_female`
\end{aligned}
$$
:::

:::{.notes}
Note that age now does show up in the difference: in other words, according to this model, the difference in mean birthweights between females and males with the same gestational age can vary by gestational age.

That's how the lines in the graph ended up non-parallel.

:::

### Stratified regression {.smaller}

:::{.notes}
We could re-write the interaction model as a stratified model, with a slope and intercept for each sex:
:::

$$
\E{Y|A=a, S=s} = 
\beta_M m + \beta_{AM} (a \cdot m) + 
\beta_F f + \beta_{AF} (a \cdot f)
$$ {#eq-model-strat}

Compare this stratified model with our interaction model, @eq-BW-lm-interact:

$$
\E{Y|A=a, S=s} = 
\beta_0 + \beta_A a + \beta_M m + \beta_{AM} (a \cdot m)
$$

::: notes

In the stratified model, the intercept term $\beta_0$ has been relabeled as $\beta_F$.

:::

::: {.column width=45%}
```{r}
#| label: tbl-bw-model-coefs-interact2
#| tbl-cap: "Birthweight model with interaction term"

bw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)
bw_lm2 |> 
  parameters() |>
  print_md(
    include_reference = TRUE,
    # show_sigma = TRUE,
    select = "{estimate}")

```

:::

:::{.column width=10%}
:::

:::{.column width=45%}

```{r}
#| label: tbl-bw-model-coefs-strat
#| tbl-cap: "Birthweight model - stratified betas"

bw_lm_strat = 
  bw |> 
  lm(
    formula = weight ~ sex + sex:age - 1, 
    data = _)

bw_lm_strat |> 
  parameters() |>
  print_md(
    # show_sigma = TRUE,
    select = "{estimate}")
```

:::

### Curved-line regression

::: notes
If we transform some of our covariates ($X$s) and plot the resulting model on the original covariate scale, we end up with curved regression lines:
:::

```{r}
#| label: fig-penguins-log-x
#| fig-cap: "`palmerpenguins` model with `bill_length` entering on log scale"
bw_lm3 = lm(weight ~ sex:log(age) - 1, data = bw)
library(palmerpenguins)

ggpenguins <- 
  palmerpenguins::penguins |> 
  dplyr::filter(species == "Adelie") |> 
  ggplot(
    aes(x = bill_length_mm , y = body_mass_g)) +
  geom_point() + 
  xlab("Bill length (mm)") + 
  ylab("Body mass (g)")

ggpenguins2 = ggpenguins +
  stat_smooth(
    method = "lm",
    formula = y ~ log(x),
    geom = "smooth") +
  xlab("Bill length (mm)") + 
  ylab("Body mass (g)")


ggpenguins2 |> print()
```

## Estimating Linear Models via Maximum Likelihood {#sec-est-LMs}

### Likelihood, log-likelihood, and score functions for linear regression {.smaller}

:::{.notes}

In EPI 203 and @sec-intro-MLEs, we learned how to fit outcome-only models of the form $p(X=x|\theta)$ to iid data $\vx = (x_1,…,x_n)$ using maximum likelihood estimation.

Now, we apply the same procedure to linear regression models:

:::

$$
\mathcal L(\vec y|\mat x,\beta, \sigma^2) = 
\prod_{i=1}^n (2\pi\sigma^2)^{-1/2} 
\exp{-\frac{1}{2\sigma^2}(y_i - \vec{x_i}'\beta)^2}
$$ {#eq-linreg-lik}

$$
\ell(\vec y|\mat x,\beta, \sigma^2) 
= -\frac{n}{2}\log{\sigma^2} - 
\frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \vec{x_i}' \beta)^2
$$ {#eq-linreg-loglik}

$$
\ell'_{\beta}(\vec y|\mat x,\beta, \sigma^2) 
= - 
\frac{1}{2\sigma^2}\deriv{\beta}
\paren{\sum_{i=1}^n (y_i - \vec{x_i}\' \beta)^2}
$$ {#eq-linreg-score}

---

::: notes
Let's switch to matrix-vector notation:
:::

$$
\sum_{i=1}^n (y_i - \vx_i\' \vb)^2 
= (\vy - \mX\vb)'(\vy - \mX\vb)
$$

---

So

$$
\begin{aligned}
(\vy - \mX\vb)'(\vy - \mX\vb) 
&= (\vy' - \vb'X')(\vy - \mX\vb)
\\ &= y'y - \vb'X'y - y'\mX\vb +\vb'\mX'\mX\beta
\\ &= y'y - 2y'\mX\beta +\beta'\mX'\mX\beta
\end{aligned}
$$

### Deriving the linear regression score function

::: notes
We will use some results from [vector calculus](math-prereqs.qmd#sec-vector-calculus):
:::

$$
\begin{aligned}
\deriv{\beta}\paren{\sum_{i=1}^n (y_i - x_i' \beta)^2} 
   &= \deriv{\beta}(\vy - X\beta)'(\vy - X\beta)
\\ &= \deriv{\beta} (y'y - 2y'X\beta +\beta'X'X\beta)
\\ &= (- 2X'y +2X'X\beta)
\\ &= - 2X'(y - X\beta)
\\ &= - 2X'(y - \Expp[y])
\\ &= - 2X' \err(y)
\end{aligned}
$${#eq-scorefun-linreg}

---

So if $\ell(\beta,\sigma^2) =0$, then

$$
\begin{aligned}
0 &= (- 2X'y +2X'X\beta)\\
2X'y &= 2X'X\beta\\
X'y &= X'X\beta\\
(X'X)^{-1}X'y &= \beta
\end{aligned}
$$

---

The second derivative matrix $\ell_{\beta, \beta'} ''(\beta, \sigma^2;\mathbf X,\vy)$ is negative definite at $\beta = (X'X)^{-1}X'y$, so $\hat \beta_{ML} = (X'X)^{-1}X'y$ is the MLE for $\beta$.

---

Similarly (not shown):

$$
\hat\sigma^2_{ML} = \frac{1}{n} (Y-X\hat\beta)'(Y-X\hat\beta)
$$

And

$$
\begin{aligned}
\mathcal I_{\beta} &= E[-\ell_{\beta, \beta'} ''(Y|X,\beta, \sigma^2)]\\
&= \frac{1}{\sigma^2}X'X
\end{aligned}
$$

---

So:

$$
Var(\hat \beta) \approx (\mathcal I_{\beta})^{-1} = \sigma^2 (X'X)^{-1}
$$

and

$$
\hat\beta \dot \sim N(\beta, \mathcal I_{\beta}^{-1})
$$ 

:::{.notes}

These are all results you have hopefully seen before.

:::

---

In the Gaussian linear regression case, we also have exact results:

$$
\frac{\hat\beta_j}{\hse{\hat\beta_j}} \dist t_{n-p}
$$ 

---

In our model 2 above, $\heinf(\beta)$ is:

```{r}
bw_lm2 |> vcov()
```

If we take the square roots of the diagonals, we get the standard errors listed in the model output:

```{r}

bw_lm2 |> vcov() |> diag() |> sqrt()

```

```{r}
#| label: tbl-mod-intx
#| tbl-cap: "Estimated model for `birthweight` data with interaction term"
bw_lm2 |> parameters() |> print_md()
```

So we can do confidence intervals, hypothesis tests, and p-values exactly as in the one-variable case we looked at previously.

### Residual Standard Deviation

::: notes
$\hs$ represents an *estimate* of the *Residual Standard Deviation* parameter, $\s$. 
We can extract $\hs$ from the fitted model, using the `sigma()` function:
:::

```{r}
#| code-fold: show
sigma(bw_lm2)
```

---

#### $\s$ is NOT "Residual standard error"

::: notes
In the `summary.lm()` output, this estimate is labeled as `"Residual standard error"`:
:::

```{r}
#| code-fold: show
summary(bw_lm2)
```

---

::: notes
However, this is a misnomer:
:::

```{r, printr.help.sections = c("description", "note")}
#| code-fold: show
library(printr) # captures ? documentation
?stats::sigma
```

## Inference about Gaussian Linear Regression Models {#sec-infer-LMs}

### Motivating example: `birthweight` data

Research question: is there really an interaction between sex and age?

$H_0: \beta_{AM} = 0$

$H_A: \beta_{AM} \neq 0$

$P(|\hat\beta_{AM}| > |`r coef(bw_lm2)["sexmale:age"]`| \mid H_0)$ = ?

### Wald tests and CIs {.smaller}

R can give you Wald tests for single coefficients and corresponding CIs:

```{r "wald tests bw_lm2"}

bw_lm2 |> 
  parameters() |>
  print_md(
    include_reference = TRUE)
```

To understand what's happening, let's replicate these results by hand for the interaction term.

### P-values {.smaller}

```{r}
bw_lm2 |> 
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE)

```


```{r}
beta_hat = coef(summary(bw_lm2))["sexmale:age", "Estimate"]
se_hat = coef(summary(bw_lm2))["sexmale:age", "Std. Error"]
dfresid = bw_lm2$df.residual
t_stat = abs(beta_hat)/se_hat
pval_t = 
  pt(-t_stat, df = dfresid, lower.tail = TRUE) +
  pt(t_stat, df = dfresid, lower.tail = FALSE)
```

$$
\begin{aligned}
&P\paren{
| \hat \beta_{AM}  | > 
| `r coef(bw_lm2)["sexmale:age"]`| \middle| H_0
} 
\\
&= \Pr \paren{
\abs{ \frac{\hat\beta_{AM}}{\hat{SE}(\hat\beta_{AM})} } > 
\abs{ \frac{`r beta_hat`}{`r se_hat`} } \middle| H_0
}\\ 
&= \Pr \paren{
\abs{ T_{`r dfresid`} }  >  `r t_stat` | H_0
}\\
&= `r pval_t`
\end{aligned}
$$ 

::: notes
This matches the result in the table above.
:::

### Confidence intervals

```{r}
bw_lm2 |> 
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE)

```

```{r}
#| label: confint-t
q_t = qt(
  p = 0.975, 
  df = dfresid, 
  lower.tail = TRUE)

q_t = qt(
  p = 0.025, 
  df = dfresid, 
  lower.tail = TRUE)


confint_radius_t = 
  se_hat * q_t

confint_t = beta_hat + c(-1,1) * confint_radius_t

print(confint_t)
```

::: notes
This also matches.
:::

### Gaussian approximations

Here are the asymptotic (Gaussian approximation) equivalents:

### P-values {.smaller}

```{r}
bw_lm2 |> 
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE)

```

```{r}
pval_z = pnorm(abs(t_stat), lower = FALSE) * 2

print(pval_z)
```

### Confidence intervals  {.smaller}

```{r}
bw_lm2 |> 
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE)

```

```{r}
confint_radius_z = se_hat * qnorm(0.975, lower = TRUE)
confint_z = 
  beta_hat + c(-1,1) * confint_radius_z
print(confint_z)
```

### Likelihood ratio statistics

```{r}

logLik(bw_lm2)
logLik(bw_lm1)

lLR = (logLik(bw_lm2) - logLik(bw_lm1)) |> as.numeric()
delta_df = (bw_lm1$df.residual - df.residual(bw_lm2))


x_max = 1

```

---

```{r}
#| label: fig-chisq-plot
#| fig-cap: "Chi-square distribution"

d_lLR = function(x, df = delta_df) dchisq(x, df = df)

chisq_plot = 
  ggplot() + 
  geom_function(fun = d_lLR) +
  stat_function( fun = d_lLR, xlim = c(lLR, x_max), geom = "area", fill = "gray") +
  geom_segment(aes(x = lLR, xend = lLR, y = 0, yend = d_lLR(lLR)), col = "red") + 
  xlim(0.0001,x_max) + 
  ylim(0,4) + 
  ylab("p(X=x)") + 
  xlab("log(likelihood ratio) statistic [x]") +
  theme_classic()
chisq_plot |> print()
```

---

Now we can get the p-value:

```{r}
#| label: LRT-pval
pchisq(
  q = 2*lLR, 
  df = delta_df, 
  lower = FALSE) |> 
  print()
```


---

In practice you don't have to do this by hand; there are functions to do it for you:

```{r}

# built in
library(lmtest)
lrtest(bw_lm2, bw_lm1)

```

## Goodness of fit

### AIC and BIC

::: notes
When we use likelihood ratio tests, we are comparing how well different models fit the data.

Likelihood ratio tests require "nested" models: one must be a special case of the other.

If we have non-nested models, we can instead use the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC):
:::

-   AIC = $-2 * \ell(\hat\theta) + 2 * p$

-   BIC = $-2 * \ell(\hat\theta) + p * \text{log}(n)$

where $\ell$ is the log-likelihood of the data evaluated using the parameter estimates $\hat\theta$, $p$ is the number of estimated parameters in the model (including $\hat\sigma^2$), and $n$ is the number of observations.

You can calculate these criteria using the `logLik()` function, or use the built-in R functions:

#### AIC in R

```{r}

-2 * logLik(bw_lm2) |> as.numeric() + 
  2*(length(coef(bw_lm2))+1) # sigma counts as a parameter here

AIC(bw_lm2)

```

#### BIC in R

```{r}

-2 * logLik(bw_lm2) |> as.numeric() + 
  (length(coef(bw_lm2))+1) * log(nobs(bw_lm2))

BIC(bw_lm2)

```

Large values of AIC and BIC are worse than small values. There are no hypothesis tests or p-values associated with these criteria.

### (Residual) Deviance

Let $q$ be the number of distinct covariate combinations in a data set.

```{r}
bw.X.unique = 
  bw |> 
  count(sex, age)

n_unique.bw  = nrow(bw.X.unique)
```

For example, in the `birthweight` data, there are $q = `r n_unique.bw`$ unique patterns (@tbl-bw-x-combos).

```{r}
#| tbl-cap: "Unique covariate combinations in the `birthweight` data, with replicate counts"
#| label: tbl-bw-x-combos
bw.X.unique
```

---

::: {#def-replicates}
#### Replicates
If a given covariate pattern has more than one observation in a dataset, those observations are called **replicates**.
:::

---

::: {#exm-replicate-bw}

#### Replicates in the `birthweight` data

In the `birthweight` dataset, there are 2 replicates of the combination "female, age 36" (@tbl-bw-x-combos).

:::

---

::: {#exr-replicate-bw}

#### Replicates in the `birthweight` data

Which covariate pattern(s) in the `birthweight` data has the most replicates?

:::

---

::: {#sol-replicate-bw}

#### Replicates in the `birthweight` data

Two covariate patterns are tied for most replicates: males at age 40 weeks 
and females at age 40 weeks.
40 weeks is the usual length for human pregnancy (@polin2011fetal), so this result makes sense.

```{r}
bw.X.unique |> dplyr::filter(n == max(n))
```

:::

---

#### Saturated models {.smaller}

The most complicated model we could fit would have one parameter (a mean) for each covariate pattern, plus a variance parameter:

```{r}
#| label: tbl-bw-model-sat
#| tbl-cap: "Saturated model for the `birthweight` data"
lm_max = 
  bw |> 
  mutate(age = factor(age)) |> 
  lm(
    formula = weight ~ sex:age - 1, 
    data = _)

lm_max |> 
  parameters() |> 
  print_md()

```

We call this model the **full**, **maximal**, or **saturated** model for this dataset.

We can calculate the log-likelihood of this model as usual:

```{r}
logLik(lm_max)
```

We can compare this model to our other models using chi-square tests, as usual:

```{r}
lrtest(lm_max, bw_lm2)
```

The likelihood ratio statistic for this test is $$\lambda = 2 * (\ell_{\text{full}} - \ell) = `r 2 * (logLik(lm_max) - logLik(bw_lm2))`$$ where:

-   $\ell_{\text{max}}$ is the log-likelihood of the full model: `r logLik(lm_max)`
-   $\ell$ is the log-likelihood of our comparison model (two slopes, two intercepts): `r logLik(bw_lm2)`

This statistic is called the **deviance** or **residual deviance** for our two-slopes and two-intercepts model; it tells us how much the likelihood of that model deviates from the likelihood of the maximal model.

The corresponding p-value tells us whether there we have enough evidence to detect that our two-slopes, two-intercepts model is a worse fit for the data than the maximal model; in other words, it tells us if there's evidence that we missed any important patterns. (Remember, a nonsignificant p-value could mean that we didn't miss anything and a more complicated model is unnecessary, or it could mean we just don't have enough data to tell the difference between these models.)

### Null Deviance

Similarly, the *least* complicated model we could fit would have only one mean parameter, an intercept:

$$\text E[Y|X=x] = \beta_0$$ We can fit this model in R like so:

```{r}
lm0 = lm(weight ~ 1, data = bw)

lm0 |> parameters() |> print_md()

```

This model also has a likelihood:

```{r}
logLik(lm0)
```

And we can compare it to more complicated models using a likelihood ratio test:

```{r}

lrtest(bw_lm2, lm0)

```

The likelihood ratio statistic for the test comparing the null model to the maximal model is $$\lambda = 2 * (\ell_{\text{full}} - \ell_{0}) = `r 2 * (logLik(lm_max) - logLik(lm0))`$$ where:

-   $\ell_{\text{0}}$ is the log-likelihood of the null model: `r logLik(lm0)`
-   $\ell_{\text{full}}$ is the log-likelihood of the maximal model: `r logLik(lm_max)`

In R, this test is:

```{r}
lrtest(lm_max, lm0)
```

This log-likelihood ratio statistic is called the **null deviance**. It tells us whether we have enough data to detect a difference between the null and full models.

## Rescaling

### Rescale age {.smaller}

```{r}

bw = 
  bw |>
  mutate(
    `age - mean` = age - mean(age),
    `age - 36wks` = age - 36
  )

lm1c = lm(weight ~ sex + `age - 36wks`, data = bw)

lm2c = lm(weight ~ sex + `age - 36wks` + sex:`age - 36wks`, data = bw)

parameters(lm2c, ci_method = "wald") |> print_md()

```

Compare with what we got without rescaling:

```{r}
parameters(bw_lm2, ci_method = "wald") |> print_md()
```

## Prediction

### Prediction for linear models

:::{#def-predicted-value}
#### Predicted value

In a regression model $\p(y|x)$, the **predicted value** of $y$ given $x$ is the estimated mean of $Y$ given $X$:

$$\hat y \eqdef \hE{Y|X=x}$$
:::

---

For linear models, the predicted value can be straightforwardly calculated by multiplying each predictor value $x_j$ by its corresponding coefficient $\beta_j$ and adding up the results:

$$
\begin{aligned}
\hat Y &= \hat E[Y|X=x] \\
&= x'\hat\beta \\
&= \hat\beta_0\cdot 1 + \hat\beta_1 x_1 + ... + \hat\beta_p x_p
\end{aligned}
$$

---

### Example: prediction for the `birthweight` data

```{r}

X = c(1,1,40)
sum(X * coef(bw_lm1))
```

R has built-in functions for prediction:
```{r}
x = tibble(age = 40, sex = "male")
bw_lm1 |> predict(newdata = x)
```

If you don't provide `newdata`, R will use the covariate values from the original dataset:

```{r}
predict(bw_lm1)
```

These special predictions are called the *fitted values* of the dataset:

:::{#def-fitted-value}

For a given dataset $(\vY, \mX)$ and corresponding fitted model $\p_{\hb}(\vy|\mx)$, the **fitted value** of $y_i$ is the predicted value of $y$ when $\vX=\vx_i$ using the estimate parameters $\hb$.

:::

R has an extra function to get these values:

```{r}
fitted(bw_lm1)
```

### Quantifying uncertainty in predictions

```{r}
bw_lm1 |> 
  predict(
    newdata = x,
    se.fit = TRUE)
```

This is a `list()`; you can extract the elements with `$` or `magrittr::use_series()`:

```{r}
bw_lm1 |> 
  predict(
    newdata = x,
    se.fit = TRUE) |> 
  use_series(se.fit)
```

You can get **confidence intervals** for $\E{Y|X=x}$:

```{R}
bw_lm1 |> predict(
  newdata = x,
  interval = "confidence")
```

You can also get **prediction intervals** for the value of an individual outcome $Y$:

```{r}
bw_lm1 |> 
  predict(newdata = x, interval = "predict")

```

The warning from the last command is: "predictions on current data refer to *future* responses" (since you already know what happened to the current data, and thus don't need to predict it).

See `?predict.lm` for more.

## Diagnostics {#sec-diagnose-LMs}

:::{.callout-tip}
This section is adapted from @dobson4e [§6.2-6.3] and 
@dunn2018generalized [Chapter 3](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3).
:::
### Assumptions in linear regression models {.smaller .scrollable}

$$Y|\vX \simind N(\vX'\b,\ss)$$

1. Normality: The distribution conditional on a given $X$ value is normal

2. Correct Functional Form: The conditional means have the structure 

$$E[Y|\vec X = \vec x] = \vec x'\beta$$
3. Homoskedasticity: The variance $\ss$ is constant (with respect to $\vx$)

4. Independence: The observations are statistically independent

### Direct visualization

::: notes
The most direct way to examine the fit of a model is to compare it to the raw observed data.
:::

```{r}
#| label: fig-bw-interaction2
#| fig-cap: "Birthweight model with interaction term"
bw = 
  bw |> 
  mutate(
    predlm2 = predict(bw_lm2)
  ) |> 
  arrange(sex, age)

plot1_interact = 
  plot1 %+% bw +
  geom_line(aes(y = predlm2))

print(plot1_interact)
```

::: notes
It's not easy to assess these assumptions from this model.
If there are multiple continuous covariates, it becomes even harder to visualize the raw data.
:::

### Residuals

::: notes
Maybe we can transform the data and model in some way to make it easier to inspect.
:::
:::{#def-resid-noise}
#### Residual noise

The **residual noise** in a probabilistic model $p(Y)$ is the difference between an observed value $y$ and its distributional mean:

$$\eps(y) \eqdef y - \Exp{Y}$$ {#eq-def-resid}
:::

:::{.notes}
We use the same notation for residual noise that we used for [errors](estimation.qmd#def-error). 
$\Exp{Y}$ can be viewed as an estimate of $Y$, before $y$ is observed.
Conversely, each observation $y$ can be viewed as an estimate of $\Exp{Y}$ (albeit an imprecise one, individually, since $n=1$). 

:::

We can rearrange @eq-def-resid to view $y$ as the sum of its mean plus the residual noise:

$$y = \Exp{Y} + \eps{y}$$

---

:::{#thm-gaussian-resid-noise}
#### Residuals in Gaussian models

If $Y$ has a Gaussian distribution, then $\err(Y)$ also has a Gaussian distribution, and vice versa.
:::

:::{.proof}
Left to the reader.
:::

---

:::{#def-resid-fitted}
#### Residual errors of a fitted model value

The **residual of a fitted value $\hat y$** (shorthand: "residual") is its [error](estimation.qmd#def-error):
$$
\ba
e(\hat y) &\eqdef \erf{\hat y}
\\&= y - \hat y
\ea
$$
:::

$e(\hat y)$ can be seen as the maximum likelihood estimate of the residual noise:

$$
\ba
e(\hy) &= y - \hat y
\\ &= \hat\eps_{ML}
\ea
$$

---

#### General characteristics of residuals

:::{#thm-resid-unbiased}
For [unbiased](estimation.qmd#sec-unbiased-estimators) estimators $\hth$:

$$\E{e(y)} = 0$$ {#eq-mean-resid-unbiased}
$$\Var{e(y)} \approx \ss$$ {#eq-var-resid-unbiased}

:::

:::{.proof}
\ 

@eq-mean-resid-unbiased:

$$
\ba
\Ef{e(y)} &= \Ef{y - \hat y}
\\ &= \Ef{y} - \Ef{\hat y}
\\ &= \Ef{y} - \Ef{y}
\\ &= 0
\ea
$$

@eq-var-resid-unbiased:

$$
\ba
\Var{e(y)} &= \Var{y - \hy}
\\ &= \Var{y} + \Var{\hy} - 2 \Cov{y, \hy}
\\ &{\dot{\approx}} \Var{y} + 0 - 2 \cdot 0
\\ &= \Var{y}
\\ &= \ss
\ea
$$
:::

---

#### Characteristics of residuals in Gaussian models

With enough data and a correct model, the residuals will be approximately Guassian distributed, with variance $\sigma^2$, which we can estimate using $\hat\sigma^2$: that is:

$$
e_i \siid N(0, \hat\sigma^2)
$$

---

:::{#exm-resid-bw}
#### residuals in `birthweight` data

R provides a function for residuals:

```{r}
resid(bw_lm2)
```

:::

:::{#exr-calc-resids}
Check R's output by computing the residuals directly.
:::

:::{.solution}
\ 
```{r}
bw$weight - fitted(bw_lm2)
```

This matches R's output!
:::

---

#### Graph the residuals

```{r}
#| label: fig-resids-intxn
#| fig-cap: "Residuals of interaction model for `birthweight` data"
bw = bw |> 
  mutate(resids_intxn = 
           weight - fitted(bw_lm2))

plot_bw_resid =
  bw |> 
  ggplot(aes(
    x = age, 
    y = resids_intxn,
    linetype = sex,
    shape = sex,
    col = sex))  +
  theme_bw() +
  xlab("Gestational age (weeks)") +
  ylab("residuals (grams)") +
  theme(legend.position = "bottom") +
  # expand_limits(y = 0, x = 0) +
  geom_point(alpha = .7)
print(plot_bw_resid + facet_wrap(~ sex))
```

---

:::{#def-stred}

#### Standardized residuals

$$r_i = \frac{e_i}{\widehat{SD}(e_i)}$$

:::

Hence, with enough data and a correct model, the standardized residuals will be approximately standard Gaussian; that is,

$$
r_i \siid N(0,1)
$$

### Marginal distributions of residuals

To look for problems with our model, we can check whether the residuals $e_i$ and standardized residuals $r_i$ look like they have the distributions that they are supposed to have, according to the model.

---

#### Standardized residuals in R

```{r}

rstandard(bw_lm2)
resid(bw_lm2)/sigma(bw_lm2)

```

::: notes
These are not quite the same, because R is doing something more complicated and precise to get the standard errors. Let's not worry about those details for now; the difference is pretty small in this case:

:::

---

```{r}

rstandard_compare_plot = 
  tibble(
    x = resid(bw_lm2)/sigma(bw_lm2), 
    y = rstandard(bw_lm2)) |> 
  ggplot(aes(x = x, y = y)) +
  geom_point() + 
  theme_bw() +
  coord_equal() + 
  xlab("resid(bw_lm2)/sigma(bw_lm2)") +
  ylab("rstandard(bw_lm2)") +
  geom_abline(
    aes(
      intercept = 0,
      slope = 1, 
      col = "x=y")) +
  labs(colour="") +
  scale_colour_manual(values="red")

print(rstandard_compare_plot)
```

---

Let's add these residuals to the `tibble` of our dataset:

```{r}

bw = 
  bw |> 
  mutate(
    fitted_lm2 = fitted(bw_lm2),
    
    resid_lm2 = resid(bw_lm2),
    # resid_lm2 = weight - fitted_lm2,
    
    std_resid_lm2 = rstandard(bw_lm2),
    # std_resid_lm2 = resid_lm2 / sigma(bw_lm2)
  )

bw |> 
  select(
    sex,
    age,
    weight,
    fitted_lm2,
    resid_lm2,
    std_resid_lm2
  )
```

---

::: notes

Now let's build histograms:

:::

```{r}
#| fig-cap: "Marginal distribution of (nonstandardized) residuals"
#| label: fig-marg-dist-resid
resid_marginal_hist = 
  bw |> 
  ggplot(aes(x = resid_lm2)) +
  geom_histogram()

print(resid_marginal_hist)
```

::: notes
Hard to tell with this small amount of data, but I'm a bit concerned that the histogram doesn't show a bell-curve shape.

:::

---


```{r}
#| fig-cap: "Marginal distribution of standardized residuals"
#| label: fig-marg-stresd
std_resid_marginal_hist = 
  bw |> 
  ggplot(aes(x = std_resid_lm2)) +
  geom_histogram()

print(std_resid_marginal_hist)
```

::: notes
This looks similar, although the scale of the x-axis got narrower, because we divided by $\hat\sigma$ (roughly speaking).

Still hard to tell if the distribution is Gaussian.

:::

---

### QQ plot of standardized residuals

::: notes
Another way to assess normality is the QQ plot of the standardized residuals versus normal quantiles:

:::

```{r}

library(ggfortify) 
# needed to make ggplot2::autoplot() work for `lm` objects

qqplot_lm2_auto = 
  bw_lm2 |> 
  autoplot(
    which = 2, # options are 1:6; can do multiple at once
    ncol = 1) +
  theme_classic()

print(qqplot_lm2_auto)
```

::: notes
If the Gaussian model were correct, these points should follow the dotted line.

Fig 2.4 panel (c) in @dobson4e is a little different; they didn't specify how they produced it, but other statistical analysis systems do things differently from R.

See also @dunn2018generalized [§3.5.4](https://link.springer.com/chapter/10.1007/978-1-4419-0118-7_3#Sec14:~:text=3.5.4%20Q%E2%80%93Q%20Plots%20and%20Normality).

:::

---

#### QQ plot - how it's built

::: notes
Let's construct it by hand:
:::

```{r}

bw = bw |> 
  mutate(
    p = (rank(std_resid_lm2) - 1/2)/n(), # "Blom's method"
    expected_quantiles_lm2 = qnorm(p)
  )

qqplot_lm2 = 
  bw |> 
  ggplot(
    aes(
      x = expected_quantiles_lm2, 
      y = std_resid_lm2, 
      col = sex, 
      shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  theme(legend.position='none') + # removing the plot legend
  ggtitle("Normal Q-Q") +
  xlab("Theoretical Quantiles") + 
  ylab("Standardized residuals")

# find the expected line:

ps <- c(.25, .75)                  # reference probabilities
a <- quantile(rstandard(bw_lm2), ps)  # empirical quantiles
b <- qnorm(ps)                     # theoretical quantiles

qq_slope = diff(a)/diff(b)
qq_intcpt = a[1] - b[1] * qq_slope

qqplot_lm2 = 
  qqplot_lm2 +
  geom_abline(slope = qq_slope, intercept = qq_intcpt)

print(qqplot_lm2)
```

---

### Conditional distributions of residuals

If our Gaussian linear regression model is correct, the residuals $e_i$ and standardized residuals $r_i$ should have:

-   an approximately Gaussian distribution, with:
-   a mean of 0
-   a constant variance

This should be true **for every** value of $x$.

---

If we didn't correctly guess the functional form of the linear component of the mean, 
$$\text{E}[Y|X=x] = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p$$

Then the the residuals might have nonzero mean.

Regardless of whether we guessed the mean function correctly, ther the variance of the residuals might differ between values of $x$.

---

#### Residuals versus fitted values

::: notes
To look for these issues, we can plot the residuals $e_i$ against the fitted values $\hat y_i$ (@fig-bw_lm2-resid-vs-fitted).
:::
```{r}
#| label: fig-bw_lm2-resid-vs-fitted
#| fig-cap: "`birthweight` model  (@eq-BW-lm-interact): residuals versus fitted values"
autoplot(bw_lm2, which = 1, ncol = 1) |> print()
```

::: notes
If the model is correct, the blue line should stay flat and close to 0, and the cloud of dots should have the same vertical spread regardless of the fitted value.

If not, we probably need to change the functional form of linear component of the mean, $$\text{E}[Y|X=x] = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p$$

:::

---

{{< include exm-plot-fits.qmd >}}
---

#### Scale-location plot

::: notes
We can also plot the square roots of the absolute values of the standardized residuals against the fitted values (@fig-bw-scale-loc).
:::

```{r}
#| label: fig-bw-scale-loc
#| fig-cap: "Scale-location plot of `birthweight` data"
autoplot(bw_lm2, which = 3, ncol = 1) |> print()
```

::: notes
Here, the blue line doesn't need to be near 0, 
but it should be flat. 
If not, the residual variance $\sigma^2$ might not be constant, 
and we might need to transform our outcome $Y$ 
(or use a model that allows non-constant variance).
:::

---

#### Residuals versus leverage

::: notes

We can also plot our standardized residuals against "leverage", which roughly speaking is a measure of how unusual each $x_i$ value is. Very unusual $x_i$ values can have extreme effects on the model fit, so we might want to remove those observations as outliers, particularly if they have large residuals.

:::

```{r}
#| labels: fig-bw_lm2_resid-vs-leverage
#| fig-cap: "`birthweight` model with interactions (@eq-BW-lm-interact): residuals versus leverage"
autoplot(bw_lm2, which = 5, ncol = 1) |> print()
```

::: notes
The blue line should be relatively flat and close to 0 here.
:::

---

### Diagnostics constructed by hand

```{r}

bw = 
  bw |> 
  mutate(
    predlm2 = predict(bw_lm2),
    residlm2 = weight - predlm2,
    std_resid = residlm2 / sigma(bw_lm2),
    # std_resid_builtin = rstandard(bw_lm2), # uses leverage
    sqrt_abs_std_resid = std_resid |> abs() |> sqrt()
    
  )

```

##### Residuals vs fitted

```{r}

resid_vs_fit = bw |> 
  ggplot(
    aes(x = predlm2, y = residlm2, col = sex, shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

```

::: {.content-visible when-format="html"}
```{r}
print(resid_vs_fit)
```
:::

::: {.content-visible when-format="pdf"}
```{r}
print(resid_vs_fit)
```
:::

##### Standardized residuals vs fitted

```{r}
bw |> 
  ggplot(
    aes(x = predlm2, y = std_resid, col = sex, shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

```

##### Standardized residuals vs gestational age

```{r}

bw |> 
  ggplot(
    aes(x = age, y = std_resid, col = sex, shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

```

##### `sqrt(abs(rstandard()))` vs fitted

Compare with `autoplot(bw_lm2, 3)`

```{r}


bw |> 
  ggplot(
    aes(x = predlm2, y = sqrt_abs_std_resid, col = sex, shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)


```

## Model selection

(adapted from @dobson4e §6.3.3; for more information on prediction, see @james2013introduction and @rms2e).

::: notes
If we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.

There are a few possible metrics to consider for choosing a "best" model.
:::

### Mean squared error

We might want to minimize the **mean squared error**, $\text E[(y-\hat y)^2]$, for new observations that weren't in our data set when we fit the model.

Unfortunately, $$\frac{1}{n}\sum_{i=1}^n (y_i-\hat y_i)^2$$ gives a biased estimate of $\text E[(y-\hat y)^2]$ for new data. If we want an unbiased estimate, we will have to be clever.

---

#### Cross-validation

```{r}
data("carbohydrate", package = "dobson")
library(cvTools)
full_model <- lm(carbohydrate ~ ., data = carbohydrate)
cv_full = 
  full_model |> cvFit(
    data = carbohydrate, K = 5, R = 10,
    y = carbohydrate$carbohydrate)

reduced_model = update(full_model, 
                       formula = ~ . - age)

cv_reduced = 
  reduced_model |> cvFit(
    data = carbohydrate, K = 5, R = 10,
    y = carbohydrate$carbohydrate)
```

---

```{r}
results_reduced = 
  tibble(
      model = "wgt+protein",
      errs = cv_reduced$reps[])
results_full = 
  tibble(model = "wgt+age+protein",
           errs = cv_full$reps[])

cv_results = 
  bind_rows(results_reduced, results_full)

cv_results |> 
  ggplot(aes(y = model, x = errs)) +
  geom_boxplot()

```

---

##### comparing metrics

```{r}

compare_results = tribble(
  ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,
  "full", cv_full$cv, summary(full_model)$r.squared,  summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),
  "reduced", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())

compare_results
```

---

```{r}
anova(full_model, reduced_model)
```

---

#### stepwise regression

```{r}
library(olsrr)
olsrr:::ols_step_both_aic(full_model)
```

---

#### Lasso

$$\arg min_{\theta} \llik(\th) + \lambda \sum_{j=1}^p|\beta_j|$$

```{r}
library(glmnet)
y = carbohydrate$carbohydrate
x = carbohydrate |> 
  select(age, weight, protein) |> 
  as.matrix()
fit = glmnet(x,y)
```

---

```{r}
#| fig-cap: "Lasso selection"
#| label: fig-carbs-lasso
autoplot(fit, xvar = 'lambda')

```

---

```{r}
cvfit = cv.glmnet(x,y)
plot(cvfit)
```

---

```{r}
coef(cvfit, s = "lambda.1se")
```


## Categorical covariates with more than two levels

### Example: `birthweight`

In the birthweight example, the variable `sex` had only two observed values:

```{r}
unique(bw$sex)
```

If there are more than two observed values, we can't just use a single variable with 0s and 1s.

### 

:::{.notes}
For example, @tbl-iris-data shows the 
[(in)famous](https://www.meganstodel.com/posts/no-to-iris/) 
`iris` data (@anderson1935irises), 
and @tbl-iris-summary provides summary statistics. 
The data include three species: "setosa", "versicolor", and "virginica".
:::

```{r}
#| tbl-cap: "The `iris` data"
#| label: tbl-iris-data
head(iris)
```

```{r}
#| tbl-cap: "Summary statistics for the `iris` data"
#| label: tbl-iris-summary

library(table1)
table1(
  x = ~ . | Species,
  data = iris,
  overall = FALSE
)

```

---

If we want to model `Sepal.Length` by species, we could create a variable $X$ that represents "setosa" as $X=1$, "virginica" as $X=2$, and "versicolor" as $X=3$.

```{r}
#| label: tbl-numeric-coding
#| tbl-cap: "`iris` data with numeric coding of species"
data(iris) # this step is not always necessary, but ensures you're starting  
# from the original version of a dataset stored in a loaded package

iris = 
  iris |> 
  tibble() |>
  mutate(
    X = case_when(
      Species == "setosa" ~ 1,
      Species == "virginica" ~ 2,
      Species == "versicolor" ~ 3
    )
  )

iris |> 
  distinct(Species, X)

```

Then we could fit a model like:

```{r}
#| label: tbl-iris-numeric-species
#| tbl-cap: "Model of `iris` data with numeric coding of `Species`"
iris_lm1 = lm(Sepal.Length ~ X, data = iris)
iris_lm1 |> parameters() |> print_md()

```

### Let's see how that model looks:

```{r}
#| label: fig-iris-numeric-species-model
#| fig-cap: "Model of `iris` data with numeric coding of `Species`"
iris_plot1 = iris |> 
  ggplot(
    aes(
      x = X, 
      y = Sepal.Length)
  ) +
  geom_point(alpha = .1) +
  geom_abline(
    intercept = coef(iris_lm1)[1], 
    slope = coef(iris_lm1)[2]) +
  theme_bw(base_size = 18)
print(iris_plot1)

```

We have forced the model to use a straight line for the three estimated means. Maybe not a good idea?

### Let's see what R does with categorical variables by default:

```{r}
#| label: tbl-iris-model-factor1
#| tbl-cap: "Model of `iris` data with `Species` as a categorical variable"
iris_lm2 = lm(Sepal.Length ~ Species, data = iris)
iris_lm2 |> parameters() |> print_md()

```

### Re-parametrize with no intercept

If you don't want the default and offset option, you can use "-1" like we've seen previously:

```{r}
#| label: tbl-iris-no-intcpt
iris.lm2b = lm(Sepal.Length ~ Species - 1, data = iris)
iris.lm2b |> parameters() |> print_md()

```

### Let's see what these new models look like:

```{r}
#| label: fig-iris-no-intcpt
iris_plot2 = 
  iris |> 
  mutate(
    predlm2 = predict(iris_lm2)) |> 
  arrange(X) |> 
  ggplot(aes(x = X, y = Sepal.Length)) +
  geom_point(alpha = .1) +
  geom_line(aes(y = predlm2), col = "red") +
  geom_abline(
    intercept = coef(iris_lm1)[1], 
    slope = coef(iris_lm1)[2]) + 
  theme_bw(base_size = 18)

print(iris_plot2)

```

### Let's see how R did that:

```{r}
#| label: tbl-iris-model-matrix-factor
formula(iris_lm2)
model.matrix(iris_lm2) |> as_tibble() |> unique()

```

This is called a "corner point parametrization".

```{r}
#| label: tbl-iris-group-point-parameterization
formula(iris.lm2b)
model.matrix(iris.lm2b) |> as_tibble() |> unique()

```

This can be called a "group point parametrization".

There are more options; see @dobson4e §6.4.1 and the 
[`codingMatrices` package](https://CRAN.R-project.org/package=codingMatrices) 
[vignette](https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf) 
(@venablescodingMatrices).

## Ordinal covariates

(c.f. @dobson4e §2.4.4)

---

::: notes
We can create ordinal variables in R using the `ordered()` function^[or equivalently, `factor(ordered = TRUE)`].
:::

:::{#exm-ordinal-variable}

```{r}
#| eval: false
url = paste0(
  "https://regression.ucsf.edu/sites/g/files/tkssra6706/",
  "f/wysiwyg/home/data/hersdata.dta")
library(haven)
hers = read_dta(url)
```

```{r}
#| include: false
library(haven)
hers = read_stata("Data/hersdata.dta")
```

```{r}
#| tbl-cap: "HERS dataset"
#| label: tbl-HERS
hers |> head()
```

:::

---

```{r}

# C(contr = codingMatrices::contr.diff)


```