logistic-regression.qmd

---
subtitle: "Logistic regression and variations"
---

# Models for Binary Outcomes {#sec-Bernoulli-models}

---

{{< include shared-config.qmd >}}

### Acknowledgements {.unnumbered}

This content is adapted from:

-   @dobson4e, Chapter 7
-   @vittinghoff2e, Chapter 5
- [David Rocke](https://dmrocke.ucdavis.edu/)'s materials from the [2021 edition of Epi 204](https://dmrocke.ucdavis.edu/Class/EPI204-Spring-2021/EPI204-Spring-2021.html)
-   @NahhasIRMPHR [Chapter 6](https://www.bookdown.org/rwnahhas/RMPH/blr.html)

## Introduction

### What is logistic regression?

**Logistic regression** is a framework for modeling [binary](probability.qmd#def-binary) outcomes, conditional on one or more *predictors* (a.k.a. *covariates*).

---

::: {#exr-binary-examples}

#### Examples of binary outcomes

What are some examples of binary outcomes in the health sciences?
:::

----

:::: {.solution}

{{< include binary-outcome-examples.qmd >}}


::::

---

Logistic regression uses the [Bernoulli](probability.qmd#def-bernoulli) distribution to model the outcome variable, conditional on one or more covariates. 

---

::: {#exr-def-bernoulli}

Write down a mathematical definition of the Bernoulli distribution.
:::

---

::::{.solution}

{{< include def-bernoulli.qmd >}}

::::

---

### Logistic regression versus linear regression

Logistic regression differs from linear regression, which uses the Gaussian ("normal") distribution to model the outcome variable, conditional on the covariates.

---

:::: {#exr-linear}

Recall: what kinds of outcomes is linear regression used for?
::::

---

::: {.solution}

Linear regression is typically used for numerical outcomes that aren't event counts or waiting times for an event. Examples of outcomes that are often analyzed using linear regression include include weight, height, and income.

:::

## Risk Estimation and Prediction

::: notes
In Epi 203, you have already seen methods for modeling binary outcomes using one covariate that is also binary (such as exposure/non-exposure).
In this section, we review one-covariate analyses, with a special focus on risk ratios and odds ratios, which are important concepts for interpreting logistic regression.
:::

---

:::::{#exm-oc-mi}
### Oral Contraceptive Use and Heart Attack

* Research question: how does oral contraceptive (OC) use affect the risk of myocardial infarction (MI; a.k.a. heart attack)?

:::{.notes}
This was an issue when oral contraceptives were first developed, because the original formulations used higher concentrations of hormones. Modern OCs don't have this issue.

@tbl-oc-mi contains simulated data for an imaginary follow-up (a.k.a. *prospective*) study in which two groups are identified, one using OCs and another not using OCs, and both groups are tracked for three years to determine how many in each groups have MIs.

:::

```{r}
#| message: false
#| code-fold: true
#| label: tbl-oc-mi
#| tbl-cap: Simulated data from study of oral contraceptive use and heart attack risk
library(dplyr)
oc_mi = 
  tribble(
    ~OC, ~MI, ~Total,
    "OC use", 13, 5000,
    "No OC use", 7, 10000
  ) |> 
  mutate(`No MI` = Total - MI) |> 
  relocate(`No MI`, .after = MI)

totals = 
  oc_mi |> 
  summarize(across(c(MI, `No MI`, Total), sum)) |> 
  mutate(OC = "Total")

tbl_oc_mi = bind_rows(oc_mi, totals)

tbl_oc_mi

```

:::::

---

::::{#exr-probs}

Review: estimate the probabilities of MI for OC users and non-OC users in @exm-oc-mi.
::::

----

:::{.solution}

```{r, include = FALSE}
#| label: calc-prs
p_MI_OC = 13/5000
p_MI_nOC = 7/10000
```

$$\ph(MI|OC) = \frac{13}{5000} = `r p_MI_OC`$$

$$\ph(MI|\neg OC) = \frac{7}{10000} = `r p_MI_nOC`$$

:::

---

#### Controls

::::{.callout-note}
##### Two meanings of "controls"

Depending on context, "controls" can mean either individuals who don't experience an *exposure* of interest, or individuals who don't experience an *outcome* of interest.

::::

---

:::{#def-cases-retrospective}
##### cases and controls in retrospective studies
In *retrospective studies*, participants who experience the outcome of interest are called **cases**, while participants who don't experience that outcome are called **controls**. 
:::

---

:::{#def-cases-prospective}
##### treatment groups and control groups in prospective studies
In *prospective studies*, the group of participants who experience the treatment or exposure of interest is called the **treatment group**, while the participants who receive the baseline or comparison treatment  (for example, clinical trial participants who receive a placebo or a standard-of-care treatment rather than an experimental treatment) are called **controls**.

:::

## Comparing Probabilities

### Risk differences

::: notes
The simplest comparison of two probabilities, $\pi_1$, and $\pi_2$, is the difference of their values:
:::

:::{#def-RD}
#### Risk difference
The **risk difference** of two probabilities, $\pi_1$, and $\pi_2$, is the difference of their values: 
$$\delta(\pi_1,\pi_2) \eqdef \pi_1 - \pi_2$$
:::

---

:::{#exm-RD}

#### Difference in MI risk

In @exm-oc-mi, the maximum likelihood estimate of the difference in MI risk between OC users and OC non-users is:
```{r, include = FALSE}
#| label: compute-risk-diff
rd_OC = p_MI_OC - p_MI_nOC
```

$$
\begin{aligned}
\hat\delta(\pi(OC), \pi(\neg OC))
&= \delta(\hat\pi(OC), \hat\pi(\neg OC))\\
&= \hat\pi(OC) - \hat\pi(\neg OC)\\
&= `r p_MI_OC` - `r p_MI_nOC`\\
&= `r rd_OC`
\end{aligned}
$$

:::

---

### Risk ratios

:::{#def-RR}
### Relative risk ratios

:::: notes
The **relative risk** of probability $\pi_1$ compared to another probability $\pi_2$, also called the **risk ratio**, **relative risk ratio**, **probability ratio**, or **rate ratio**, is the ratio of those probabilities:
::::

$$\rho(\pi_1,\pi_2) = \frac{\pi_1}{\pi_2}$$
:::

---

:::{#exm-RR}

:::: notes
Above, we estimated that:
::::

$$\ph(MI|OC) = `r 13/5000`$$

$$\ph(MI|\neg OC) = `r 7/10000`$$

:::: notes
So we might estimate that the *relative risk* of MI for OC versus non-OC is:
::::

```{r}
#| label: compute-risk-ratio
rr = (13/5000)/(7/10000)
```

$$
\begin{aligned}
\hat\rho(OC, \neg OC)
&=\frac{\ph(MI|OC)}{\ph(MI|\neg OC)}\\
&= \frac{`r 13/5000`}{`r 7/10000`}\\
&= `r rr`
\end{aligned}
$$

:::: notes
We might summarize this result by saying that "the estimated probability of MI among OC users was `r rr` as high as the estimated probability among OC non-users. 
::::

:::

---

### Relative risk differences

:::{#def-RRD}

#### Relative risk difference

:::: notes
Sometimes, we divide the risk difference by the comparison probability; the result is called the **relative risk difference**:
::::

$$\xi(\pi_1,\pi_2) \eqdef \frac{\delta(\pi_1,\pi_2)}{\pi_2}$$

:::

---

:::{#thm-rrd-vs-rr}
#### Relative risk difference equals risk ratio minus 1

$$\xi(\pi_1,\pi_2) = \rho(\pi_1,\pi_2) - 1$$
:::

---

::: proof
$$
\begin{aligned}
\xi(\pi_1,\pi_2) 
&\eqdef \frac{\delta(\pi_1,\pi_2)}{\pi_2}
\\&= \frac{\pi_1-\pi_2}{\pi_2}
\\&= \frac{\pi_1}{\pi_2} - 1
\\&= \rho(\pi_1,\pi_2) - 1
\end{aligned}
$$
:::

---

### Changing reference groups in risk comparisons

:::: notes
Risk differences, risk ratios, and relative risk differences are defined by two probabilities, plus a choice of which probability is the **baseline** or **reference** probability (i.e., which probability is the subtrahend of the risk difference or the denominator of the risk ratio).
:::

$$\delta(\pi_2,\pi_1) = -\delta(\pi_1, \pi_2)$$

$$\rho(\pi_2,\pi_1) = \inv{\rho(\pi_1,\pi_2)}$$
$$\xi(\pi_2,\pi_1) = \inv{\xi(\pi_2,\pi_1) + 1} - 1$$

:::{#exr-change-ref-group}

Prove the relationships above.

:::

---

:::{#exm-ref}

#### Switching the reference group in a risk ratio

Above, we estimated that the risk ratio of OC versus non-OC is:

$$
\begin{aligned}
\rho(OC, \neg OC)
&= `r (13/5000)/(7/10000)`
\end{aligned}
$$

In comparison, the risk ratio for non-OC versus OC is:

$$
\begin{aligned}
\rho(\neg OC, OC)
&=\frac{\ph(MI|\neg OC)}{\ph(MI|OC)}\\
&= \frac{`r 7/10000`}{`r 13/5000`}\\
&= `r (7/10000)/(13/5000)`\\
&= \frac{1}{\rho(OC, \neg OC)}
\end{aligned}
$$

:::

## Odds and Odds Ratios

### Odds and probabilities

::: notes
In logistic regression, we will make use of a mathematically-convenient transformation of probability, called *odds*.
:::

:::{#def-odds}
#### Odds
The **odds** of an outcome $A$, which we will represent using $\odds$ ("omega"), is the probability that the outcome occurs, divided by the probability that it doesn't occur:

$$\odds(A) \eqdef \frac{\Pr(A)}{\Pr(\neg A)}$$
:::

---

:::{#thm-prob-to-odds}
If the probability of an outcome $A$ is $\Pr(A)=\pi$, 
then the corresponding odds of $A$ is:

$$\oddsf{\pi} = \frac{\pi}{1-\pi}$$ {#eq-odds-probs}

:::

---

:::{.proof}
$$
\ba
\Pr(\neg A) &= 1 - \Pr(A)
\\ &= 1 - \pi
\ea
$$

$$
\ba
\tf \odds(A) &\eqdef \frac{\Pr(A)}{\Pr(\neg A)} 
\\ &= \frac{\pi}{1-\pi}
\ea
$$

:::

---

::: notes
[Function @eq-odds-probs], which transforms probabilities into odds, can be called the **odds function**. @fig-odds-probs graphs the shape of this function.
:::

```{r}
#| code-fold: true
#| label: fig-odds-probs
#| fig-cap: "Odds versus probability"

odds = function(pi) pi / (1 - pi)
library(ggplot2)
ggplot() +
  geom_function(
    fun = odds,
    arrow = arrow(ends = "last"),
    mapping = aes(col = "odds function")
  ) +
  xlim(0, .99) +
  xlab("Probability") +
  ylab("Odds") +
  geom_abline(aes(
    intercept = 0,
    slope = 1,
    col = "y=x"
  )) +
  theme_bw() +
  labs(colour = "") +
  theme(legend.position = "bottom")

```

---

:::{#exm-odds}

#### Computing odds from probabilities

In @exr-probs, we estimated that the probability of MI, given OC use, is $\pi(OC) \eqdef \Pr(MI|OC) = `r 13/5000`$. If this estimate is correct, then the odds of MI, given OC use, is:

```{r}
#| include: false
#| label: compute-odds-from-prob
pi_OC = 13/5000
odds_OC = odds(pi_OC)
```


$$
\begin{aligned}
\odds(OC) 
&\eqdef \frac{\Pr(MI|OC)}{\Pr(\neg MI|OC)}\\
&=\frac{\Pr(MI|OC)}{1-\Pr(MI|OC)}\\
&=\frac{\pi(OC)}{1-\pi(OC)}\\
&=\frac{`r 13/5000`}{1-`r 13/5000`}\\
&\approx `r odds_OC |> round(digits = 6) |> as.character()`
\end{aligned}
$$

:::

---

:::{#exr-odds}

#### Computing odds from probabilities

Estimate the odds of MI, for non-OC users.

::::{.solution}
```{r, include = FALSE}
pi_nOC = 7 / 10000
odds_nOC = pi_nOC / (1 - pi_nOC)
```

$$
\odds(\neg OC) = `r odds_nOC`
$$

::::

:::

---

:::{#thm-est-odds}
#### One-sample MLE for odds

Let $X_1,...X_n$ be a set of $n$ $\iid$ Bernoulli trials, and let $X = \sumin X_i$ be their sum. 

Then the maximum likelihood estimate of the odds of $X=1$, $\odds$, is:

$$
\hat{\odds}= \frac{x}{n-x}
$$

:::

---

::: proof

$$
\begin{aligned}
1-\hat\pi 
&= 1-\frac{x}{n}\\
&= \frac{n}{n} - \frac{x}{n}\\
&= \frac{n - x}{n}
\end{aligned}
$$

Thus, the estimated odds is:

$$
\begin{aligned}
\frac{\hat\pi}{1-\hat\pi}
&= \frac{\left(\frac{x}{n}\right)}{\left(\frac{n-x}{n}\right)}\\
&= \frac{x}{n-x}
\end{aligned}
$$

:::: notes
That is, odds can be calculated directly as "# events" divided by "# nonevents" (without needing to calculate $\hat\pi$ and $1-\hat\pi$ first).
::::

:::

---

::::{#exm-odds-shortcut}

#### calculating odds using the shortcut
In @exm-odds, we calculated 
$$
\begin{aligned}
\odds(OC) 
&=`r (13/5000)/(1- (13/5000))`
\end{aligned}
$$

Let's recalculate this result using our shortcut.

::::

---

::::{#sol-odds-shortcut}
$$
\begin{aligned}
\odds(OC) 
&=\frac{13}{5000-13}\\
&=`r (13)/(5000- 13)`
\end{aligned}
$$

Same answer as in @exm-odds!

::::

---

:::{#thm-odds-simplified}

#### Simplified expression for odds function

::: notes
An equivalent expression for the odds function is 
:::

$$
\oddsf{\pi} = \invf{\invf{\pi}-1)}
$$ {#eq-odds-reduced}

:::

---

:::{#exr-odds2}
Prove that @eq-odds-reduced is equivalent to @def-odds.
:::

---

:::{#thm-deriv-odds}
#### Derivative of odds function

$$\doddsf{\pi} = \frac{1}{\sqf{1-\pi}}$$
:::

---

::: proof
$$
\ba
\doddsf{\pi}
   &= \deriv{\pi}\paren{\frac{\pi}{1-\pi}}
\\ &=  \frac {\deriv{\pi}\pi} {1-\pi} - 
       \paren{\frac{\pi}{\sqf{1-\pi}} \cd \deriv{\pi}\paren{1-\pi}}
\\ &=  \frac{1}{1-\pi} - \frac{\pi}{\sqf{1-\pi}} \cd (-1)
\\ &=  \frac{1}{1-\pi} + \frac{\pi}{\sqf{1-\pi}}
\\ &=  \frac{1-\pi}{\sqf{1-\pi}} + \frac{\pi}{\sqf{1-\pi}}
\\ &=  \frac{1-\pi + \pi}{\sqf{1-\pi}}
\\ &=  \frac{1}{\sqf{1-\pi}}
\ea
$$

:::

---

#### Odds of rare events

For rare events (small $\pi$), odds and probabilities are nearly equal, because $1-\pi \approx 1$ (see @fig-odds-probs).

For example, in @exm-odds, the probability and odds differ by $`r abs(pi_OC - odds_OC)`$.

---

:::{#exr-odds-probs}

What odds value corresponds to the probability $\pi = 0.2$, and what is the numerical difference between these two values?

:::

---

::::{.solution}
$$
\odds = \frac{\pi}{1-\pi} 
=\frac{.2}{.8}
= .25
$$
::::

---

:::{#thm-odds-minus-probs}

Let $\odds = \frac{\pi}{1-\pi}$. Then:

$$\odds - \pi =  \odds \cd \pi$$

:::

---

:::{#exr-odds-minus-probs}

Prove @thm-odds-minus-probs.

:::

---

::: solution

$$
\ba
\odds - \pi 
&= \frac{\pi}{1-\pi} - \pi
\\ &= \frac{\pi}{1-\pi} - \frac{\pi(1-\pi)}{1-\pi}
\\ &= \frac{\pi}{1-\pi} - \frac{\pi - \pi^2}{1-\pi}
\\ &= \frac{\pi - (\pi - \pi^2)}{1-\pi}
\\ &= \frac{\pi - \pi + \pi^2}{1-\pi}
\\ &= \frac{\pi^2}{1-\pi}
\\ &= \frac{\pi}{1-\pi} \pi
\\ &= \odds \pi
\ea
$$

:::

---

:::{#lem-odds-neg}
#### Odds of a non-event

If $\pi$ is the odds of event $A$ 
and $\odds$ is the corresponding odds of $A$,
then the odds of $\neg A$ are:

$$
\odds(\neg A) = \frac{1-\pi}{\pi}
$$

:::

::: proof
Left to the reader.
::: 

---

### The inverse odds function

:::{#def-inv-odds}

##### inverse odds function
The **inverse odds function**, 

$$\invoddsf{\odds} \eqdef \frac{\odds}{1 + \odds}$$
converts odds into their corresponding probabilities (@fig-inv-odds).
:::

::: notes
The inverse-odds function takes an odds as input and produces a probability as output. Its domain of inputs is $[0,\infty)$ and its range of outputs is $[0,1]$.

I haven't seen anyone give the inverse-odds function a concise name; maybe $\text{prob}()$?
:::

---

```{r}
#| label: fig-inv-odds
#| fig-cap: The inverse odds function, $\invoddsf{\odds}$
odds_inv = function(omega) (1 + omega^-1)^-1
ggplot() +
  geom_function(fun = odds_inv, aes(col = "inverse-odds")) +
  xlab("Odds") +
  ylab("Probability") +
  xlim(0,5) +
  ylim(0,1) +
  geom_abline(aes(intercept = 0, slope = 1, col = "x=y"))
```

---

:::{#exr-odds-probs}

What probability corresponds to an odds of $\odds = 1$, and what is the numerical difference between these two values?

:::

---

::::{.solution}
$$
\pi(1) = \frac{1}{1+1} 
=\frac{1}{2}
= .5
$$
$$
1 - \pi(1) = 1 - .5 = .5
$$

::::

---

:::{#lem-invodds-simplified}

##### Simplified expression for inverse odds function

::: notes
An equivalent expression for the inverse odds function is 
:::

$$
\pi(\odds) = (1+\odds^{-1})^{-1}
$$ {#eq-inv-odds-reduced}

:::

---

:::{#exr-inv-odds2}
Prove that @eq-inv-odds-reduced is equivalent to @def-inv-odds.
:::

---


:::{#lem-one-minus-odds-inv}
#### One minus inverse-odds


$$1 - \oddsinvf{\odds} = \frac{1}{1+\odds}$$
:::

---

::: {.proof}

$$
\ba
1 - \oddsinvf{\odds} &= 1 - \frac{\odds}{1 + \odds}
\\ &= \frac{\red{1+\odds}}{1 + \odds} - \frac{\blue{\odds}}{1 + \odds}
\\ &= \frac{\red{(1+\odds)} - \blue{\odds}}{1 + \odds}
\\ &= \frac{1 + \odds - \odds}{1 + \odds}
\\ &= \frac{1}{1 + \odds}
\ea
$$
:::

---

:::{#thm-inverse-odds-nonevent}

If $\odds$ is the odds of event $A$, 
then the probability that $A$ does not occur is:

$$\Pr(\neg A) = \frac{1}{1+\odds}$$
:::

---

::: proof

:::: notes
Use @lem-one-minus-odds-inv:
::::

$$
\ba
\Pr(\neg A) 
&= 1 - \Pr(A)
\\ &= 1 - \oddsinvf{\odds}
\\ &= \frac{1}{1 + \odds}
\ea
$$
:::

---

:::{#thm-deriv-invodds}

##### Derivative of inverse odds function
$$\doddsinvf{\odds} = \frac{1}{\sqf{1+\odds}}$$
:::

---

::: proof

:::: notes
Use the quotient rule:
::::

$$
\ba
{\oddsinv}'(\odds) 
&= \deriv{\odds} \oddsinvf{\odds}
\\ &= \deriv{\odds} \frac{\odds}{1+\odds}
\\ &= \frac{\deriv{\odds}\odds}{1+\odds} - \frac{\odds}{\sqf{1+\odds}} \cd \deriv{\odds}(1+\odds) 
\\ &= \frac{1}{1+\odds} - \frac{\odds}{\sqf{1+\odds}} \cd 1
\\ &= \frac{1}{1+\odds} - \frac{\odds}{\sqf{1+\odds}}
\\ &= \frac{1+\odds}{\sqf{1+\odds}} - \frac{\odds}{\sqf{1+\odds}}
\\ &= \frac{1+\odds - \odds}{\sqf{1+\odds}}
\\ &= \frac{1}{\sqf{1+\odds}}
\ea
$$

:::

---


:::{#cor-deriv-invodds}

$$\doddsinvf{\odds} = \sqf{1 - \oddsinvf{\odds}}$$
:::

---

### Odds ratios

::: notes

Now that we have defined odds, we can introduce another way of comparing event probabilities: odds ratios.

:::

:::{#def-OR}
##### Odds ratio
The **odds ratio** for two odds $\odds_1$, $\odds_2$ is their ratio:

$$\theta(\odds_1, \odds_2) = \frac{\odds_1}{\odds_2}$$

:::

---

:::{#exm-OR}
#### Calculating odds ratios
In @exm-oc-mi, the odds ratio for OC users versus OC-non-users is:

$$
\begin{aligned}
\theta(\odds(OC), \odds(\neg OC))
&= \frac{\odds(OC)}{\odds(\neg OC)}\\
&= \frac{`r pi_OC`}{`r pi_nOC`}\\
&= `r pi_OC / pi_nOC`\\
\end{aligned}
$$
:::

---

#### A shortcut for calculating odds ratio estimates {.smaller}

::: notes
The general form of a two-by-two table is shown in @tbl-2x2-generic. 


|               | Event | Non-Event | Total
|-------------- | ------ | ---------- | ----- 
|Exposed        | a      | b | a+b
|Non-exposed    | c      | d | c+d
|Total         | a+c   | b+d | a+b+c+d

: A generic 2x2 table {#tbl-2x2-generic}

:::

::: notes
From this table, we have:
:::

* $\hat\pi(Event|Exposed) = a/(a+b)$

* $\hat\pi(\neg Event|Exposed) = b/(a+b)$

* $\hat\odds(Event|Exposed) = \frac{\left(\frac{a}{a+b}\right)}{\left(\frac{b}{a+b}\right)}=\frac{a}{b}$

* $\hat\odds(Event|\neg Exposed) = \frac{c}{d}$
(see @exr-odds-generic)

* $\theta(Exposed,\neg Exposed) = \frac{\frac{a}{b}}{\frac{c}{d}} = \frac{ad}{bc}$

---

:::{#exr-odds-generic}
Given @tbl-2x2-generic, show that $\hat\odds(Event|\neg Exposed) = \frac{c}{d}$.
:::

---

#### Properties of odds ratios {#sec-OR-props}

:::: notes
Odds ratios have a special property: we can swap a covariate with the outcome, and the odds ratio remains the same.
::::

:::{#thm-or-swap}
##### Odds ratios are reversible


For any two events $A$, $B$:

$$\theta(A|B) = \theta(B|A)$$

:::

---

::: {.proof}

{{< include prf-OR-reversible.qmd >}}

:::

---

:::{#exm-or-inv-MI}

In @exm-oc-mi, we have:

$$
\begin{aligned}
\theta(MI; OC) 
&\eqdef
\frac{\odds(MI|OC)}{\odds(MI|\neg OC)}\\
&\eqdef \frac
{\left(\frac{\Pr(MI|OC)}{\Pr(\neg MI|OC)}\right)}
{\left(\frac{\Pr(MI|\neg OC)}{\Pr(\neg MI|\neg OC)}\right)}\\
&= \frac
{\left(\frac{\Pr(MI,OC)}{\Pr(\neg MI,OC)}\right)}
{\left(\frac{\Pr(MI,\neg OC)}{\Pr(\neg MI,\neg OC)}\right)}\\
&= \left(\frac{\Pr(MI,OC)}{\Pr(\neg MI,OC)}\right)
\left(\frac{\Pr(\neg MI,\neg OC)}{\Pr(MI,\neg OC)}\right)\\
&= \left(\frac{\Pr(MI,OC)}{\Pr(MI,\neg OC)}\right)
\left(\frac{\Pr(\neg MI,\neg OC)}{\Pr(\neg MI,OC)}\right)\\
&= \left(\frac{\Pr(OC,MI)}{\Pr(\neg OC,MI)}\right)
\left(\frac{\Pr(\neg OC,\neg MI)}{\Pr(OC,\neg MI)}\right)\\
&= \left(\frac{\Pr(OC|MI)}{\Pr(\neg OC|MI)}\right)
\left(\frac{\Pr(\neg OC|\neg MI)}{\Pr(OC|\neg MI)}\right)\\
&= \frac{\left(\frac{\Pr(OC|MI)}{\Pr(\neg OC|MI)}\right)}
{\left(\frac{\Pr(OC|\neg MI)}{\Pr(\neg OC|\neg MI)}\right)}\\
&\eqdef \frac{\odds(OC|MI)}
{\odds(OC|\neg MI)}\\
&\eqdef \theta(OC; MI)
\end{aligned}
$$
:::

---

:::{#exr-2x2-generic}
For @tbl-2x2-generic, show that $\hat\theta(Exposed, Unexposed) = \hat\theta(Event, \neg Event)$.
:::

---

:::: notes
Conditional odds ratios have the same reversibility property:
::::

:::{#thm-conditional-OR-swap}
##### Conditional odds ratios are reversible


For any three events $A$, $B$, $C$:

$$\theta(A|B,C) = \theta(B|A,C)$$
:::

---

:::{.proof}
Apply the same steps as for @thm-or-swap, 
inserting $C$ into the conditions (RHS of $|$) of every expression.
:::

---

#### Odds Ratios vs Probability (Risk) Ratios {#sec-OR-RR}

::: notes
When the outcome is rare (i.e., its probability is small) for both groups being compared in an odds ratio, the odds of the outcome will be similar to the probability of the outcome, and thus the risk ratio will be similar to the odds ratio.
:::

##### Case 1: rare events

For rare events, odds ratios and probability (a.k.a. risk, a.k.a.
prevalence) ratios will be close:

$\pi_1 = .01$ $\pi_2 = .02$

```{r "rare events or-rr"}
pi1 = .01
pi2 = .02
pi2/pi1
odds(pi2)/odds(pi1)

```

---

:::{#exm-or-rr-OC-MI}
In @exm-oc-mi, the outcome is rare for both OC and non-OC participants, so the odds for both groups are similar to the corresponding probabilities, and the odds ratio is similar the risk ratio.
:::

---

##### Case 2: frequent events

$\pi_1 = .4$ $\pi_2 = .5$

For more frequently-occurring outcomes, this won't be the case:

```{r}
pi1 = .4
pi2 = .5
pi2/pi1
odds(pi2)/odds(pi1)

```

---

#### Odds Ratios in Case-Control Studies

::: notes
@tbl-oc-mi simulates a follow-up study in which two populations were followed and the number of MI’s was observed. 
The risks are $P(MI|OC)$ and $P(MI|\neg OC)$ and we can estimate these risks from the data.

But suppose we had a case-control study in which we had 100 women with MI and selected a comparison group of 100 women without MI (matched as groups on age, etc.). 
Then MI is not random, and we cannot compute P(MI|OC) and we cannot compute the risk ratio. 
However, the odds ratio however can be computed.

The disease odds ratio is the odds for the disease in the exposed group divided by the odds for the disease in the unexposed group, and we cannot validly compute and use these separate parts.

We can still validly compute and use the exposure odds ratio, 
which is the odds for exposure in the disease group divided by the odds for exposure in the non-diseased group 
(because exposure can be treated as random):

:::

$$
\hth(OC|MI) = 
\frac{\hat{\odds}(OC|MI)}{\hat{\odds}(OC|\neg MI)}
$$

::: notes
And these two odds ratios, $\hth(MI|OC)$ and $\hth(OC|MI)$,
are mathematically equivalent, as we saw in @sec-OR-props:
:::

$$\hth(MI|OC) = \hth(OC|MI)$$

---

:::{#exr-or-rev}
Calculate the odds ratio of MI with respect to OC use, 
assuming that @tbl-oc-mi comes from a case-control study. 
Confirm that the result is the same as in @exm-OR.

---

::::{.solution}

::: {.content-visible when-format="revealjs"}

```{r}
#| label: tbl-oc-mi2
#| tbl-cap: "Simulated data from study of oral contraceptive use and heart attack risk"
tbl_oc_mi
```
:::

* $\odds(OC|MI) = P(OC|MI)/(1 – P(OC|MI) = \frac{13}{7} = `r 13/7`$

* $\odds(OC|\neg MI) = P(OC|\neg MI)/(1 – P(OC|\neg MI) = \frac{4987}{9993} = `r 4987/9993`$

* $\theta(OC,MI) = \frac{\odds(OC|MI)}{\odds(OC|\neg MI)} = \frac{13/7}{4987/9993} = `r (13/7)/(4987/9993)`$

::: notes
This is the same estimate we calculated in @exm-OR.
:::

::::

:::

---

#### Odds Ratios in Cross-Sectional Studies

* If a cross-sectional study is a probability sample of a population (which it rarely is) then we can estimate risks.

* If it is a sample, but not an unbiased probability sample, then we need to treat it in the same way as a case-control study.

* We can validly estimate odds ratios in either case.

* But we can usually not validly estimate risks and risk ratios.

## The logit and expit functions

### The logit function

:::{#def-logit}

#### logit function

::: notes
The **logit function**  of a probability $\pi$ is 
the natural logarithm of the odds function of $\pi$:
:::

$$\logit(\pi) \eqdef \log{\odds(\pi)}$$

:::

---

::: notes
@fig-logit shows the shape of the $\logit()$ function.
:::

```{r}
#| code-fold: true
#| fig-cap: "The logit function"
#| label: fig-logit
logit = function(p) log(odds(p))

logit_plot = 
  ggplot() + 
  geom_function(
    fun = logit,
    arrow = arrow(ends = "both")) + 
  xlim(.001, .999) + 
  ylab("logit(p)") +
  xlab("p") +
  theme_bw()
print(logit_plot)
```

---

:::{#thm-logit-function}

$$\logit(\pi) = \log{\frac{\pi}{1-\pi}}$${#eq-logit}

:::

---

:::{#exr-prove-eq-logit}

#### Compose the logit function

Prove @thm-logit-function.
:::

---

::: proof
Apply @def-logit and then @def-odds (details left to the reader).
:::

---

:::{#thm-deriv-logit}

#### Derivative of logit function

$$\logit'(\pi) = \frac{1}{(\pi)(1-\pi)}$$
:::

--- 

::: proof

$$
\ba
\logit'(\pi)
   &= \deriv{\pi}\logit(\pi)
\\ &= \deriv{\pi}\log{\odds(\pi)}
\\ &= \frac{\odds'(\pi)}{\odds(\pi)}
\\ &= \odds'(\pi) \frac{1}{\odds(\pi)}
\\ &= \frac{1}{\sqf{1-\pi}}\frac{1-\pi}{\pi}
\\ &= \frac{1}{(\pi)(1-\pi)}
\ea
$$

:::

---

### The expit function

:::{#def-expit}

#### expit, logistic, inverse-logit

The **expit function** (@fig-expit-plot) of a log-odds $\logodds$, also known as the **inverse-logit function** or **logistic function**, is the inverse-odds of the exponential of $\logodds$:

$$\expit(\logodds) \eqdef \oddsinvf{\expf{\logodds}}$$

:::

```{r}
#| label: fig-expit-plot
#| fig-cap: The expit function
#| code-fold: true
expit = function(eta)
  exp(eta) / (1 + exp(eta))
library(ggplot2)
expit_plot =
  ggplot() +
  geom_function(fun = expit,
                arrow = arrow(ends = "both")) +
  xlim(-8, 8) +
  ylim(0, 1) +
  ylab(expression(expit(eta))) +
  xlab(expression(eta)) +
  theme_bw()
print(expit_plot)
```

---

:::{#thm-logit-expit}
#### logit and expit are each others' inverses

$$\logitf{\expitf{\logodds}} = \logodds$$

$$\expitf{\logitf{\pi}} = \pi$$

:::

---

:::{.proof}
Left to the reader.
:::

---

:::{#thm-expit}

#### Expressions for expit function

$$
\ba
\expit(\logodds) 
   &= \frac{\exp{\logodds}}{1+\exp{\logodds}}
\\ &= (1 + \exp{-\logodds})^{-1}
\ea
$$
:::

---

::: proof
Apply definitions and @lem-invodds-simplified.
Details left to the reader.
:::

---


:::{#lem-one-minus-expit}
$$1-\expitf{\logodds} = \inv{1+\exp{\logodds}}$$
:::

---

::: {.proof}

Using @lem-one-minus-odds-inv:

$$
\ba
1 - \expitf{\logodds} &= 1 - \oddsinvf{\expf{\logodds}}
\\ &= \frac{1}{1+\expf{\logodds}}
\\ &= \inv{1 + \exp{\logodds}}
\ea
$$
:::

---

:::{#lem-deriv-expit}
$$\dexpitf{\logodds} = (\expitf{\logodds}) (1 - \expitf{\logodds})$$
:::

---

::: proof

Using @thm-deriv-invodds:

$$
\ba
\dexpitf{\logodds}
   &= \deriv{\logodds} \expitf{\logodds}   
\\ &= \deriv{\logodds} \invoddsf{\expf{\logodds}}
\\ &= \dinvoddsf{\expf{\logodds}}  \deriv{\logodds}\expf{\logodds}
\\ &=  \frac{1}{\sqf{1 + \expf{\logodds}}} \expf{\logodds}
\\ &=  \frac{\expf{\logodds}}{\sqf{1 + \expf{\logodds}}} 
\\ &=  \frac{\expf{\logodds}}{1 + \expf{\logodds}} \frac{1}{1 + \expf{\logodds}} 
\\ &=  \expitf{\logodds} (1-\expitf{\logodds})
\ea
$$

:::

---

::: proof

Alternatively, we can use @thm-expit:

$$
\ba
\dexpitf{\logodds}
   &= \deriv{\logodds} \expitf{\logodds}
\\ &= \deriv{\logodds} (1 + \exp{-\logodds})^{-1}
\\ &= -(1 + \exp{-\logodds})^{-2} \deriv{\logodds} (1 + \exp{-\logodds})
\\ &= -(1 + \exp{-\logodds})^{-2} (-\exp{-\logodds})
\\ &= (1 + \exp{-\logodds})^{-2} (\exp{-\logodds})
\\ &= (1 + \exp{-\logodds})^{-1} \frac{\exp{-\logodds}}{1 + \exp{-\logodds}}
\\ &= (1 + \exp{-\logodds})^{-1} \frac{1}{1 + \exp{\logodds}}
\\ &= \expitf{\logodds} (1-\expitf{\logodds})
\ea
$$
:::

### Diagram of expit and logit {#sec-expit-logit-diagram}

::: {.content-visible when-format="pdf"}
$$
\left[ \pi  \eqdef\Pr(Y=1)\right]
\underbrace{
\overbrace{
\underset{
\xleftarrow[ \frac{\odds}{1+\odds}]{}
}
{
\xrightarrow{\frac{\pi}{1-\pi}}
}
\left[\odds \eqdef \text{odds}(Y=1)\right]
\underset{
\xleftarrow[\exp{\logodds}]{}
}
{
\xrightarrow{\log{\odds}}
}
}^{\logit(\pi)}
}_{\expit(\logodds)}
\left[\logodds \eqdef \text{log-odds}(Y=1)\right]
$$
:::

::: {.content-hidden when-format="pdf"}

$$
\underbrace{\pi}_{\atop{\Pr(Y=1)} }
\overbrace{
\underbrace{
\underset{
\xleftarrow[\frac{\odds}{1+\odds}]{}
}
{
\xrightarrow{\frac{\pi}{1-\pi}}
}
\underbrace{\odds}_{\text{odds}(Y=1)}
\underset{
\xleftarrow[\exp{\logodds}]{}
}
{
\xrightarrow{\log{\odds}}
}
}_{\expit(\logodds)}
}^{\logit(\pi)}
\underbrace{\logodds}_{\atop{\text{log-odds}(Y=1)}}
$$

:::

## Introduction to logistic regression

::: notes

* In @exm-oc-mi, we estimated the risk and the odds of MI for two groups, 
defined by oral contraceptive use.

* If the predictor is quantitative (dose) or there is more than one predictor, the task becomes more difficult.

* In this case, we will use logistic regression, which is a generalization of the linear regression models you have been using that can account for a binary response instead of a continuous one.

:::

### Binary outcomes models - one group, no covariates {.smaller}

$$
\begin{aligned}
\P(Y=1) &= \pi\\
\P(Y=0) &= 1-\pi\\
\P(Y=y) &= \pi^y (1-\pi)^{1-y}\\
\mathbf y  &= (y_1, ..., y_n)\\
\mathcal L(\pi;\mathbf y) &= \pi^{\sum y_i} (1-\pi)^{n - \sum y_i}\\
\ell(\pi, \mathbf y) &= \left({\sum y_i}\right) \log{\pi} + \left(n - \sum y_i\right) \log{1-\pi}\\
&= \left({\sum y_i}\right) \left(\log{\pi} - \log{1-\pi}\right) + n \cdot \log{1-\pi}\\
&= \left({\sum y_i}\right) \log{\frac{\pi}{ 1-\pi}} + n \cdot \log{1-\pi}
\\ &= \left({\sum y_i}\right) \logit(\pi) + n \cdot \log{1-\pi}
\end{aligned}
$$

### Binary outcomes - general  {.smaller}

$$
\ba
\P(Y_i=1) &= \pi_i
\\ \P(Y_i=0) &= 1-\pi_i
\ea
$$

$$\P(Y_i=y_i) = (\pi_i)^y_i (1-\pi_i)^{1-y_i}$$

$$\Lik_i(\pi_i) = \P(Y_i=y_i)$$

$$
\ba
\ell_i(\pi_i) 
   &= \log{\Lik_i(\pi_i)}
\\ &= y_i \log{\pi_i} + (1-y_i) \log{1-\pi_i}
\ea
$$

---

For $\iid$ data $\vec y  = (y_1, ..., y_n)$:

$$
\ba
\Lik(\pi;\vec y) &= \P(\dsvn{Y}{y})
\\ &= \prod_{i=1}^n \P(Y_i=y_i)
\\ &= \prod_{i=1}^n \Lik_i(\pi_i) 
\ea
$$

### Modeling $\pi_i$ as a function of $X_i$

If there are only a few distinct $X_i$ values, we can model $\pi_i$
separately for each value of $X_i$.

Otherwise, we need regression.

$$
\begin{aligned}
\pi(x) &\equiv \text{E}(Y=1|X=x)\\
&= f(x^\top\beta)
\end{aligned}
$$

Typically, we use the $\expit$ inverse-link: 

$$\pi(\vec x) = \expit(\vx'\beta)$${#eq-logistic-link}


### Meet the beetles

```{r}
#| label: fig-beetles_1a
#| fig-cap: "Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935)"
library(glmx)

data(BeetleMortality, package = "glmx")
beetles = BeetleMortality |>
  mutate(
    pct = died/n,
    survived = n - died
  )

plot1 = 
  beetles |> 
  ggplot(aes(x = dose, y = pct)) +
  geom_point(aes(size = n)) +
  xlab("Dose (log mg/L)") +
  ylab("Mortality rate (%)") +
  scale_y_continuous(labels = scales::percent) +
  scale_size(range = c(1,2)) +
  theme_bw(base_size = 18)

print(plot1)
```

### Why don't we use linear regression?

```{r}
#| label: fig-beetles_2
#| fig-cap: "Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935)"

beetles_long = beetles  |> 
   reframe(
    .by = everything(),
    outcome = c(
      rep(1, times = died), 
      rep(0, times = survived))) |> 
  as_tibble()

lm1 = beetles_long |> lm(formula = outcome ~ dose)
f.linear = function(x) predict(lm1, newdata = data.frame(dose = x))

range1 = range(beetles$dose) + c(-.2, .2)

plot2 = 
  plot1 + 
  geom_function(
    fun = f.linear, 
    aes(col = "Straight line")) +
  labs(colour="Model", size = "")

plot2 |> print()

```

### Zoom out

```{r}
#| label: fig-beetles_3
#| fig-cap: "Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935)"
(plot2 + expand_limits(x = c(1.6, 2))) |> print()
```

### log transformation of dose?

```{r}
#| label: fig-beetles_4
#| fig-cap: "Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935)"

lm2 = beetles_long |> lm(formula = outcome ~ log(dose))
f.linearlog = function(x) predict(lm2, newdata = data.frame(dose = x))

plot3 = plot2 + 
  expand_limits(x = c(1.6, 2)) +
  geom_function(fun = f.linearlog, aes(col = "Log-transform dose"))
(plot3 + expand_limits(x = c(1.6, 2))) |> print()

```

### Logistic regression

```{r}

#| label: fig-beetles_5
#| fig-cap: "Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935)"

beetles_glm_grouped = beetles |>
  glm(formula = cbind(died, survived) ~ dose, family = "binomial")
f = function(x)
  beetles_glm_grouped |>
  predict(newdata = data.frame(dose = x), type = "response")

plot4 = plot3 + geom_function(fun = f, aes(col = "Logistic regression"))
plot4 |> print()
```

---

### Three parts to regression models

-   What distribution does the outcome have for a specific subpopulation
defined by covariates? (outcome model)

-   How does the combination of covariates relate to the mean? (link
function)

-   How do the covariates combine? (linear predictor, interactions)

---

### Logistic regression in R

```{r}

beetles_glm_grouped = 
  beetles |> 
  glm(
    formula = cbind(died, survived) ~ dose, 
    family = "binomial")

library(parameters)
beetles_glm_grouped |> 
  parameters() |> 
  print_md()

```

---

Fitted values:

```{r}
fitted.values(beetles_glm_grouped)
predict(beetles_glm_grouped, type = "response")
predict(beetles_glm_grouped, type = "link")

fit_y = beetles$n * fitted.values(beetles_glm_grouped)
```

---

### Individual observations

```{r}
#| tbl-cap: "`beetles` data in long format"
#| label: tbl-beetles-long

beetles_long

```

---

```{r}
#| label: tbl-beetles-model-ungrouped
#| tbl-cap: "logistic regression model for beetles data with individual Bernoulli data"
beetles_glm_ungrouped = 
  beetles_long |> 
  glm(
    formula = outcome ~ dose, 
    family = "binomial")

beetles_glm_ungrouped |> parameters() |> print_md()
```

::: notes
Here's the previous version again:
:::

```{r}
#| label: tbl-beetles-model-grouped
#| tbl-cap: "logistic regression model for beetles data with grouped (binomial) data"
beetles_glm_grouped |> parameters() |> print_md()
```

---

::: notes
They seem the same! But not quite:
:::

```{r}

logLik(beetles_glm_grouped)
logLik(beetles_glm_ungrouped)

```

::: notes
The difference is due to the binomial coefficient
$\left(n\atop x \right)$ which isn't included in the
individual-observations (Bernoulli) version of the model.
:::

## Multiple logistic regression

### Coronary heart disease (WCGS) study data

::: notes
Let's use the data from the Western Collaborative Group Study (WCGS) 
(@rosenman1975coronary) to explore multiple logistic regression:
:::

*From @vittinghoff2e:*

"The **Western Collaborative Group Study (WCGS)** was a large
epidemiological study designed to investigate the association between
the "type A" behavior pattern and coronary heart disease (CHD)".

*From Wikipedia, "Type A and Type B personality theory":*

"The hypothesis describes Type A individuals as outgoing, ambitious,
rigidly organized, highly status-conscious, impatient, anxious,
proactive, and concerned with time management....

The hypothesis describes Type B individuals as a contrast to those of
Type A. Type B personalities, by definition, are noted to live at lower
stress levels. They typically work steadily and may enjoy achievement,
although they have a greater tendency to disregard physical or mental
stress when they do not achieve."

---

#### Study design

*from `?faraway::wcgs`:*

3154 healthy young men aged 39-59 from the San Francisco area were
assessed for their personality type. All were free from coronary heart
disease at the start of the research. Eight and a half years later
change in CHD status was recorded.

*Details (from `faraway::wcgs`)*

The WCGS began in 1960 with 3,524 male volunteers who were employed by
11 California companies. Subjects were 39 to 59 years old and free of
heart disease as determined by electrocardiogram. After the initial
screening, the study population dropped to 3,154 and the number of
companies to 10 because of various exclusions. The cohort comprised both
blue- and white-collar employees.

---

### Baseline data collection

socio-demographic characteristics:

-   age
-   education
-   marital status
-   income
-   occupation
-   physical and physiological including:
-   height
-   weight
-   blood pressure
-   electrocardiogram
-   corneal arcus;

---

biochemical measurements:

-   cholesterol and lipoprotein fractions;
-   medical and family history and use of medications;

---

behavioral data:

-   Type A interview,
-   smoking,
-   exercise
-   alcohol use.

---

Later surveys added data on: 

* anthropometry 
* triglycerides 
* Jenkins Activity Survey 
* caffeine use

Average follow-up continued for 8.5 years with repeat examinations.


### Load the data

Here, I load the data:

```{r}
#| code-fold: show
#| eval: false

### load the data directly from a UCSF website:
library(haven)
url = paste0(
    # I'm breaking up the url into two chunks for readability
    "https://regression.ucsf.edu/sites/g/files/",
    "tkssra6706/f/wysiwyg/home/data/wcgs.dta")
wcgs = haven::read_dta(url)

```

```{r}
#| include: false
# I presaved the data in my project's `data` folder
library(here) # provides the `here()` function
library(fs) # provides the `path()` function
here::here() |> 
  fs::path('Data/wcgs.rda') |> 
  load()

```

```{r}
#| label: tbl-wcgs
#| tbl-cap: "`wcgs` data"
wcgs |> head()
```


### Data cleaning

::: notes
Now let's do some data cleaning
:::

```{r}
#| code-fold: show
library(arsenal) # provides `set_labels()`
library(forcats) # provides `as_factor()`
library(haven)
library(plotly)
wcgs = wcgs |> 
  mutate(
    age = age |> 
      arsenal::set_labels("Age (years)"),
    
    arcus = 
      arcus |> 
      as.logical() |> 
      arsenal::set_labels("Arcus Senilis"),
    
    time169 = 
      time169 |> 
      as.numeric() |> 
      arsenal::set_labels("Observation (follow up) time (days)"),
    
    dibpat =
      dibpat |> 
      as_factor() |> 
      relevel(ref = "Type B") |> 
      arsenal::set_labels("Behavioral Pattern"),
    
    typchd69 = typchd69 |> 
      labelled(
        label = "Type of CHD Event",
        labels = 
          c(
            "None" = 0, 
            "infdeath" = 1,
            "silent" = 2,
            "angina" = 3)),
    
    # turn stata-style labelled variables in to R-style factors:
    across(
      where(is.labelled), 
      haven::as_factor)
  )

```

### What's in the data {.smaller}

Here's a table of the data:

```{r, results = "asis"}
wcgs |>
  select(-c(id, uni, t1)) |>
  tableby(chd69 ~ ., data = _) |>
  summary(
    pfootnote = TRUE,
    title =
      "Baseline characteristics by CHD status at end of follow-up")

```

### Data by age and personality type

For now, we will look at the interaction between age and personality
type (`dibpat`). To make it easier to visualize the data, we summarize
the event rates for each combination of age:

```{r}
#| label: "event rates by age"
chd_grouped_data = 
  wcgs |> 
  summarize(
    .by = c(age, dibpat),
    n = sum(chd69 %in% c("Yes", "No")),
    x = sum(chd69 == "Yes")) |> 
  mutate(
    `n - x` = n - x,
    `p(chd)` = (x / n) |> 
      labelled(label = "CHD Event by 1969"),
    `odds(chd)` = `p(chd)` / (1 - `p(chd)`), 
    `logit(chd)` = log(`odds(chd)`)
  )

chd_grouped_data
```

### Graphical exploration

```{r}
#| label: fig-chd-graph-prob-scale
#| fig-cap: "CHD rates by age group, probability scale"

library(ggplot2)
library(ggeasy)
library(scales)
chd_plot_probs = 
  chd_grouped_data |> 
  ggplot(
    aes(
      x = age, 
      y = `p(chd)`, 
      col = dibpat)
  ) +
  geom_point(aes(size = n), alpha = .7) + 
  scale_size(range = c(1,4)) +
  geom_line() +
  theme_bw() +
  ylab("P(CHD Event by 1969)") +
  scale_y_continuous(
    labels = scales::label_percent(),
    sec.axis = sec_axis(
      ~ odds(.),
      name = "odds(CHD Event by 1969)")) +
  ggeasy::easy_labs() +
  theme(legend.position = "bottom")

print(chd_plot_probs)
```

---

#### Odds scale

```{r}
#| label: fig-chd-odds-scale
#| fig-cap: "CHD rates by age group, odds spacing"

trans_odds = trans_new(
  name = "odds", 
  transform = odds, 
  inverse = odds_inv)

chd_plot_odds = chd_plot_probs + 
  scale_y_continuous(
    trans = trans_odds, # this line changes the vertical spacing
    name = chd_plot_probs$labels$y,
    sec.axis = sec_axis(
      ~ odds(.),
      name = "odds(CHD Event by 1969)"))

print(chd_plot_odds)
```

---

#### Log-odds (logit) scale

```{r}
#| label: fig-chd_plot_logit
#| fig-cap: "CHD data (logit-scale)"

trans_logit = trans_new(
  name = "logit", 
  transform = logit, 
  inverse = expit)

chd_plot_logit = 
  chd_plot_probs + 
  scale_y_continuous(
    trans = trans_logit, # this line changes the vertical spacing
    name = chd_plot_probs$labels$y,
    breaks = c(seq(.01, .1, by = .01), .15, .2),
    minor_breaks = NULL,
    sec.axis = sec_axis(
      ~ logit(.),
      name = "log(odds(CHD Event by 1969))"))

print(chd_plot_logit)
```

### Logistic regression models for CHD data {.smaller}

Here, we fit stratified models for CHD by personality type.

```{r}
#| tbl-cap: "CHD model, stratified parametrization"
#| label: tbl-chd-strat
chd_glm_strat = glm(
  "formula" = chd69 == "Yes" ~ dibpat + dibpat:age - 1, 
  "data" = wcgs,
  "family" = binomial(link = "logit")
)

chd_glm_strat |> parameters() |> print_md()

```

We can get the corresponding odds ratios ($e^{\beta}$s) by passing
`exponentiate = TRUE` to `parameters()`:

```{r}
#| tbl-cap: "Odds ratio estimates for CHD model"
#| label: tbl-chd-glm-strat-OR
chd_glm_strat |> 
  parameters(exponentiate = TRUE) |> 
  print_md()
```

### Models superimposed on data

We can graph our fitted models on each scale (probability, odds,
log-odds).

---

#### probability scale

```{r}

curve_type_A = function(x) 
{
  chd_glm_strat |> predict(
    type = "response",
    newdata = tibble(age = x, dibpat = "Type A"))
}

curve_type_B = function(x) 
{
  chd_glm_strat |> predict(
    type = "response",
    newdata = tibble(age = x, dibpat = "Type B"))
}

chd_plot_probs_2 =
  chd_plot_probs +
  geom_function(
    fun = curve_type_A,
    aes(col = "Type A")
  ) +
  geom_function(
    fun = curve_type_B,
    aes(col = "Type B")
  )
print(chd_plot_probs_2)
```

---

#### odds scale

```{r}
#| label: fig-fitted-odds-chd
# curve_type_A = function(x) 
# {
#   chd_glm_strat |> predict(
#     type = "link",
#     newdata = tibble(age = x, dibpat = "Type A")) |> exp()
# }
# curve_type_B = function(x) 
# {
#   chd_glm_strat |> predict(
#     type = "link",
#     newdata = tibble(age = x, dibpat = "Type B")) |> exp()
# }

chd_plot_odds_2 =
  chd_plot_odds +
  geom_function(
    fun = curve_type_A,
    aes(col = "Type A")
  ) +
  geom_function(
    fun = curve_type_B,
    aes(col = "Type B")
  )
print(chd_plot_odds_2)

```

---

#### log-odds (logit) scale

```{r}
#| label: fig-fitted-log-odds

chd_plot_logit_2 =
  chd_plot_logit +
  geom_function(
    fun = curve_type_A,
    aes(col = "Type A")
  ) +
  geom_function(
    fun = curve_type_B,
    aes(col = "Type B")
  )

print(chd_plot_logit_2)
```

### reference-group and contrast parametrization

We can also use the corner-point parametrization (with reference groups
and contrasts):

```{r}
#| label: tbl-model-corner-point
#| tbl-cap: "CHD model (corner-point parametrization)"
chd_glm_contrasts = 
  wcgs |> 
  glm(
    "data" = _,
    "formula" = chd69 == "Yes" ~ dibpat*I(age - 50), 
    "family" = binomial(link = "logit")
  )

chd_glm_contrasts |> 
  parameters() |> 
  print_md()
```

::: notes
Compare with @tbl-chd-glm-strat-OR.
:::

::: {.content-visible when-format="revealjs"}

```{r}
#| tbl-cap: "Odds ratio estimates for CHD model"
#| label: tbl-chd-glm-strat-OR-v2
chd_glm_strat |> 
  parameters() |> 
  print_md()
```

:::

---

:::{#exr-strat-to-contrast}
If I give you model 1, how would you get the coefficients of model 2?
:::

---

:::{#thm-logistic-OR}

For the logistic regression model:

- $Y_i|\vX_i \simind \Ber(\pi(\vX_i))$
- $\pi(\vx) = \expitf{\vx'\vb}$

Consider two covariate patterns, $\vx$ and $\vec{x^*}$.

The odds ratio comparing these covariate patterns is:

$$
\omega(\vx,\vec{x^*}) = \exp{(\vx-\vec{x^*})\' \vb}
$$

:::

---

::: proof
$$
\ba
\omega(\vx,\vec{x^*}) 
&= \frac {\odds(Y=1 | \vX = \vx)} {\odds(Y=1 | \vX = \vec{x^*})}
\\ &= \frac{\exp{\vx\'\vb}}{\exp{{\vec{x^*}}\' \vb}}
\\ &= \exp{\vx\'\vb - {\vec{x^*}}\' \vb}
\\ &= \exp{(\vx\' - {\vec{x^*}}\') \vb}
\\ &= \exp{{(\vx - \vec{x^*})}\' \vb}
\ea
$$
:::

## Fitting logistic regression models

### Maximum likelihood estimation for $\ciid$ data

Assume:

- $Y_i|\vX_i \simind \Ber(\pi(X_i))$
- $\pi(\vx) = \expitf{\vx'\vb}$

---

#### log-likelihood function

$$
\ba
\ell(\vb, \vy) 
   &= \log{\Lik(\vb, \vec y) }
\\ &= \sumin \red{\ell_i}(\pi(\vx_i))
\ea
$$ {#eq-loglik-bernoulli-iid}

---

$$
\ba
\red{\ell_i}(\pi)
   &= y_i \log{\pi} + (1 - y_i) \log{1-\pi}
\\ &= y_i \log{\pi} + (1 \cd \log{1-\pi} - y_i \cd \log{1-\pi})
\\ &= y_i \log{\pi} + (\log{1-\pi} - y_i \log{1-\pi})
\\ &= y_i \log{\pi} + \log{1-\pi} - y_i \log{\blue{1-\pi}}
\\ &= y_i \log{\pi} - y_i \log{\blue{1-\pi}} + \log{1-\pi}
\\ &= (y_i \log{\pi} - y_i \log{\blue{1-\pi}}) + \log{1-\pi}
\\ &= y_i (\log{\red{\pi}} - \log{\blue{1-\pi}}) + \log{1-\pi}
\\ &= y_i \paren{\log{\frac{\red{\pi}}{\blue{1-\pi}}}} + \log{1-\pi}
\\ &= y_i (\logit(\pi)) + \log{1-\pi}
\ea
$$


---

#### score function

$$
\ba
\ell'(\vb) 
   &\eqdef \deriv{\vb} \ell(\vb)
\\ &=      \deriv{\vb} \sumin \ell_i(\vb)
\\ &=      \sumin \deriv{\vb} \ell_i(\vb)
\\ &=      \sumin \ell'_i(\vb)
\ea
$$

---

$$
\ba
\ell_i'(\vb) 
   &= \deriv{\vb} y_i \paren{\logitf{\pi_i}} + \log{1-\pi_i}
\\ &= \deriv{\vb}\cb{y_i \paren{\vx_i'\vb} + \log{1-\pi_i}}
\\ &= \cb{y_i \deriv{\vb}\paren{\vx_i'\vb} + \deriv{\vb}\log{1-\pi_i}}
\\ &= \cb{\vx_i y_i + \deriv{\vb}\log{1-\expit(\vx_i'\vb)}}
\\ &= \cb{\vx_i y_i + \deriv{\vb}\log{\inv{1+\exp{\vx_i'\vb}}}}
\\ &= \cb{\vx_i y_i - \deriv{\vb}\log{1+\exp{\vx_i'\vb}}}
\ea
$$

---

Now we need to apply the [chain rule](math-prereqs.qmd#thm-chain-rule):

$$
\deriv{\beta}\log{1+\exp{\vx_i'\beta}} = 
\frac{1}{1+\exp{\vx_i'\beta}} \deriv{\beta}\cb{1+\exp{\vx_i'\beta}}
$$

$$
\ba
\deriv{\beta}\cb{1+\exp{\vx_i'\beta}}
   &= \exp{\vx_i'\beta} \deriv{\beta}\vx_i'\beta
\\ &= \vx_i \exp{\vx_i'\beta} 
\ea
$$

So:

$$
\ba
\deriv{\beta}\log{1+\exp{\vx_i'\beta}} 
   &= \frac{1}{1+\exp{\vx_i'\beta}} \exp{\vx_i'\beta} \vx_i
\\ &= \frac{\exp{\vx_i'\beta}}{1+\exp{\vx_i'\beta}}  \vx_i
\\ &= \vx_i \expitf{\vx_i'\beta}
\ea
$$

---

So:

$$
\ba
\llik_i'(\vb) 
&= \vx_i y_i - \vx_i \expitf{\vx_i'\beta} 
\\ &= \vx_i (y_i - \expitf{\vx_i'\beta})
\\ &= \vx_i (y_i - \pi_i)
\\ &= \vx_i (y_i - \Expp[Y_i|\vX_i=\vx_i])
\\ &= \vx_i \ \err(y_i|\vX_i=\vx_i)
\ea
$$

::: notes

This last expression is essentially the same as we found in [linear regression](Linear-models-overview.qmd#eq-scorefun-linreg).
:::

---

Putting the pieces of $\llik'(\vb)$ back together, we have:

$$
\llik'(\vb) = \sumin \cb{\vx_i(y_i - \expitf{\vx_i'\beta}) }
$$

Setting $\ell'(\vb; \vy) = 0$ gives us:


$$\sumin \cb{\vx_i(y_i - \expitf{\vx_i'\beta}) } = 0$$ {#eq-score-logistic}


---

::: notes

In general, the estimating equation $\ell'(\vb; \vy) = 0$ cannot
be solved analytically.

Instead, we can use the [Newton-Raphson method](intro-MLEs.qmd#sec-newton-raphson):

:::

$$
\esttmp{\theta} 
\leftarrow \esttmp{\theta} - \inv{\hessf{\vec y;\esttmp{\theta}}} 
\scoref{\vec y;\esttmp{\theta}}
$$

::: notes

We make an iterative series of
guesses, and each guess helps us make the next guess better (i.e., higher
log-likelihood). You can see some information about this process like so:

:::

```{r}
#| label: out-glm-fitting-process
#| code-fold: false
options(digits = 8)
temp = 
  wcgs |> 
  glm(
    control = glm.control(trace = TRUE),
    data = _,
    formula = chd69 == "Yes" ~ dibpat*age, 
    family = binomial(link = "logit")
  )

```

::: notes

After each iteration of the fitting procedure, the deviance
($2(\ell_{\text{full}} - \ell(\hat\beta))$ ) is printed. You can see
that the algorithm took six iterations to converge to a solution where
the likelihood wasn't changing much anymore.

:::

## Model comparisons for logistic models {#sec-gof}

### Deviance test

We can compare the maximized log-likelihood of our model,
$\ell(\hat\beta; \mathbf x)$, versus the log-likelihood of the full
model (aka saturated model aka maximal model), $\ell_{\text{full}}$,
which has one parameter per covariate pattern. With enough data,
$2(\ell_{\text{full}} - \ell(\hat\beta; \mathbf x)) \dot \sim \chi^2(N - p)$,
where $N$ is the number of distinct covariate patterns and $p$ is the
number of $\beta$ parameters in our model. A significant p-value for
this **deviance** statistic indicates that there's some detectable
pattern in the data that our model isn't flexible enough to catch.

::: callout-caution
The deviance statistic needs to have a large amount of data **for each
covariate pattern** for the $\chi^2$ approximation to hold. A guideline
from Dobson is that if there are $q$ distinct covariate patterns
$x_1...,x_q$, with $n_1,...,n_q$ observations per pattern, then the
expected frequencies $n_k \cdot \pi(x_k)$ should be at least 1 for every
pattern $k\in 1:q$.
:::

If you have covariates measured on a continuous scale, you may not be
able to use the deviance tests to assess goodness of fit.

### Hosmer-Lemeshow test

If our covariate patterns produce groups that are too small, a
reasonable solution is to make bigger groups by merging some of the
covariate-pattern groups together.

Hosmer and Lemeshow (1980) proposed that we group the patterns by their
predicted probabilities according to the model of interest. For example,
you could group all of the observations with predicted probabilities of
10% or less together, then group the observations with 11%-20%
probability together, and so on; $g=10$ categories in all.

Then we can construct a statistic
$$X^2 = \sum_{c=1}^g \frac{(o_c - e_c)^2}{e_c}$$ where $o_c$ is the
number of events *observed* in group $c$, and $e_c$ is the number of
events expected in group $c$ (based on the sum of the fitted values
$\hat\pi_i$ for observations in group $c$).

If each group has enough observations in it, you can compare $X^2$ to a
$\chi^2$ distribution; by simulation, the degrees of freedom has been
found to be approximately $g-2$.

For our CHD model, this procedure would be:

```{r}
wcgs = 
  wcgs |> 
  mutate(
    pred_probs_glm1 = chd_glm_strat |> fitted(),
    pred_prob_cats1 = 
      pred_probs_glm1 |> 
      cut(breaks = seq(0, 1, by = .1), 
          include.lowest = TRUE))

HL_table = 
  wcgs |> 
  summarize(
    .by = pred_prob_cats1,
    n = n(),
    o = sum(chd69 == "Yes"),
    e = sum(pred_probs_glm1)
  )

library(pander)
HL_table |> pander()

X2 = HL_table |> 
  summarize(
    `X^2` = sum((o-e)^2/e)
  ) |> 
  pull(`X^2`)
print(X2)

pval1 = pchisq(X2, lower = FALSE, df = nrow(HL_table) - 2)
```

Our statistic is $X^2 = `r X2`$; $p(\chi^2(1) > `r X2`) = `r pval1`$,
which is our p-value for detecting a lack of goodness of fit.

Unfortunately that grouping plan left us with just three categories with
any observations, so instead of grouping by 10% increments of predicted
probability, typically analysts use deciles of the predicted
probabilities:

```{r}
wcgs = 
  wcgs |> 
  mutate(
    pred_probs_glm1 = chd_glm_strat |> fitted(),
    pred_prob_cats1 = 
      pred_probs_glm1 |> 
      cut(breaks = quantile(pred_probs_glm1, seq(0, 1, by = .1)), 
          include.lowest = TRUE))

HL_table = 
  wcgs |> 
  summarize(
    .by = pred_prob_cats1,
    n = n(),
    o = sum(chd69 == "Yes"),
    e = sum(pred_probs_glm1)
  )

HL_table |> pander()

X2 = HL_table |> 
  summarize(
    `X^2` = sum((o-e)^2/e)
  ) |> 
  pull(`X^2`)

print(X2)

pval1 = pchisq(X2, lower = FALSE, df = nrow(HL_table) - 2)
```

Now we have more evenly split categories. The p-value is $`r pval1`$,
still not significant.

Graphically, we have compared:

```{r}

HL_plot = 
  HL_table |> 
  ggplot(aes(x = pred_prob_cats1)) + 
  geom_line(aes(y = e, x = pred_prob_cats1, group = "Expected", col = "Expected")) +
  geom_point(aes(y = e, size = n, col = "Expected")) +
  geom_point(aes(y = o, size = n, col = "Observed")) +
  geom_line(aes(y = o, col = "Observed", group = "Observed")) +
  scale_size(range = c(1,4)) +
  theme_bw() +
  ylab("number of CHD events") +
  theme(axis.text.x = element_text(angle = 45))
```

::: {.content-visible when-format="html"}
```{r}
ggplotly(HL_plot)
```
:::

::: {.content-visible when-format="pdf"}
```{r}
print(HL_plot)
```
:::

### Comparing models

-   AIC = $-2 * \ell(\hat\theta) + 2 * p$ \[lower is better\]
-   BIC = $-2 * \ell(\hat\theta) + p * \text{log}(n)$ \[lower is
better\]
-   likelihood ratio \[higher is better\]

## Residual-based diagnostics

### Logistic regression residuals only work for grouped data

Residuals only work if there is more than one observation for most
covariate patterns.

Here we will create the grouped-data version of our CHD model from the
WCGS study:

```{r}

wcgs_grouped = 
  wcgs |> 
  summarize(
    .by = c(dibpat, age),
    n = n(),
    chd = sum(chd69 == "Yes"),
    `!chd` = sum(chd69 == "No")
  )

chd_glm_strat_grouped = glm(
  "formula" = cbind(chd, `!chd`) ~ dibpat + dibpat:age - 1, 
  "data" = wcgs_grouped,
  "family" = binomial(link = "logit")
)

chd_glm_strat_grouped |> parameters() |> print_md()

```

### (Response) residuals

$$e_k \eqdef \bar y_k - \hat{\pi}(x_k)$$

($k$ indexes the covariate patterns)

We can graph these residuals $e_k$ against the fitted values
$\hat\pi(x_k)$:

```{r}
#| code-fold: show
wcgs_grouped = 
  wcgs_grouped |> 
  mutate(
    fitted = chd_glm_strat_grouped |> fitted(),
    fitted_logit = fitted |> logit(),
    response_resids = 
      chd_glm_strat_grouped |> resid(type = "response")
  )

wcgs_response_resid_plot = 
  wcgs_grouped |> 
  ggplot(
    mapping = aes(
      x = fitted,
      y = response_resids
    )
  ) + 
  geom_point(
    aes(col = dibpat)
  ) +
  geom_hline(yintercept = 0) + 
  geom_smooth( #<1>
    se = TRUE,  #<1>
    method.args = list( #<1>
      span=2/3, #<1>
      degree=1, #<1>
      family="symmetric", #<1>
      iterations=3), #<1>
    method = stats::loess) #<1>

```

1.  Don't worry about these options for now; I chose them to match
`autoplot()` as closely as I can. `plot.glm` and `autoplot` use
`stats::lowess` instead of `stats::loess`; `stats::lowess` is older,
hard to use with `geom_smooth`, and hard to match exactly with
`stats::loess`; see https://support.bioconductor.org/p/2323/.\]

::: {.content-visible when-format="pdf"}
```{r}
#| fig-height: 6
wcgs_response_resid_plot |> print()
```
:::

::: {.content-visible when-format="html"}
```{r}
wcgs_response_resid_plot |> ggplotly()
```
:::

We can see a slight fan-shape here: observations on the right have
larger variance (as expected since $var(\bar y) = \pi(1-\pi)/n$ is
maximized when $\pi = 0.5$).

### Pearson residuals

The fan-shape in the response residuals plot isn't necessarily a concern
here, since we haven't made an assumption of constant residual variance,
as we did for linear regression.

However, we might want to divide by the standard error in order to make
the graph easier to interpret. Here's one way to do that:

The Pearson (chi-squared) residual for covariate pattern $k$ is: $$
\begin{aligned}
X_k &= \frac{\bar y_k - \hat\pi_k}{\sqrt{\hat \pi_k (1-\hat\pi_k)/n_k}}
\end{aligned}
$$

where $$
\begin{aligned}
\hat\pi_k 
&\eqdef \hat\pi(x_k)\\
&\eqdef \hat P(Y=1|X=x_k)\\ 
&\eqdef \expit(x_i'\hat \beta)\\
&\eqdef \expit(\hat \beta_0 + \sum_{j=1}^p \hat \beta_j x_{ij})
\end{aligned}
$$

Let's take a look at the Pearson residuals for our CHD model from the
WCGS data (graphed against the fitted values on the logit scale):

```{r}
library(ggfortify)
```

::: {.content-visible when-format="pdf"}
```{r}
#| fig-height: 6
autoplot(chd_glm_strat_grouped, which = 1, ncol = 1) |> print()
```
:::

::: {.content-visible when-format="html"}
```{r}
autoplot(chd_glm_strat_grouped, which = 1, ncol = 1) |> print()
```
:::

The fan-shape is gone, and these residuals don't show any obvious signs
of model fit issues.

#### Pearson residuals plot for `beetles` data

If we create the same plot for the `beetles` model, we see some strong
evidence of a lack of fit:

::: {.content-visible when-format="html"}
```{r}
autoplot(beetles_glm_grouped, which = 1, ncol = 1) |> print()
```
:::

::: {.content-visible when-format="pdf"}
```{r}
#| fig-height: 6
autoplot(beetles_glm_grouped, which = 1, ncol = 1) |> print()
```
:::

#### Pearson residuals with individual (ungrouped) data

What happens if we try to compute residuals without grouping the data by
covariate pattern?

```{r}
library(ggfortify)
```

::: {.content-visible when-format="html"}
```{r}
autoplot(chd_glm_strat, which = 1, ncol = 1) |> print()
```
:::

::: {.content-visible when-format="pdf"}
```{r}
#| fig-height: 6
autoplot(chd_glm_strat, which = 1, ncol = 1) |> print()
```
:::

Meaningless.

#### Residuals plot by hand (*optional section*)

If you want to check your understanding of what these residual plots
are, try building them yourself:

```{r}

wcgs_grouped = 
  wcgs_grouped |> 
  mutate(
    fitted = chd_glm_strat_grouped |> fitted(),
    fitted_logit = fitted |> logit(),
    resids = chd_glm_strat_grouped |> resid(type = "pearson")
  )

wcgs_resid_plot1 = 
  wcgs_grouped |> 
  ggplot(
    mapping = aes(
      x = fitted_logit,
      y = resids
      
    ) 
    
  ) + 
  geom_point(
    aes(col = dibpat)
  ) +
  geom_hline(yintercept = 0) + 
  geom_smooth(se = FALSE, 
              method.args = list(
                span=2/3,
                degree=1,
                family="symmetric",
                iterations=3,
                surface="direct"
                # span = 2/3, 
                # iterations = 3
              ),
              method = stats::loess)
# plot.glm and autoplot use stats::lowess, which is hard to use with 
# geom_smooth and hard to match exactly; 
# see https://support.bioconductor.org/p/2323/

```

::: {.content-visible when-format="pdf"}
```{r}
#| fig-height: 6
wcgs_resid_plot1 |> print()
```
:::

::: {.content-visible when-format="html"}
```{r}
wcgs_resid_plot1 |> ggplotly()
```
:::

### Pearson chi-squared goodness of fit test

The Pearson chi-squared goodness of fit statistic is: $$
X^2 = \sum_{k=1}^m X_k^2 
$$ Under the null hypothesis that the model in question is correct
(i.e., sufficiently complex), $X^2\ \dot \sim\ \chi^2(N-p)$.

```{r}

X = chd_glm_strat_grouped |> 
  resid(type = "pearson")

chisq_stat = sum(X^2)

pval = pchisq(
  chisq_stat, 
  lower = FALSE, 
  df = length(X) - length(coef(chd_glm_strat_grouped)))


```

For our CHD model, the p-value for this test is `r pval`; no significant
evidence of a lack of fit at the 0.05 level.

#### Standardized Pearson residuals

Especially for small data sets, we might want to adjust our residuals
for leverage (since outliers in $X$ add extra variance to the
residuals):

$$r_{P_k} = \frac{X_k}{\sqrt{1-h_k}}$$

where $h_k$ is the leverage of $X_k$. The functions `autoplot()` and
`plot.lm()` use these for some of their graphs.

### Deviance residuals

For large sample sizes, the Pearson and deviance residuals will be
approximately the same. For small sample sizes, the deviance residuals
from covariate patterns with small sample sizes can be unreliable (high
variance).

$$d_k = \text{sign}(y_k - n_k \hat \pi_k)\left\{\sqrt{2[\ell_{\text{full}}(x_k) - \ell(\hat\beta; x_k)]}\right\}$$

#### Standardized deviance residuals

$$r_{D_k} = \frac{d_k}{\sqrt{1-h_k}}$$

### Diagnostic plots

Let's take a look at the full set of `autoplot()` diagnostics now for
our `CHD` model:

```{r}
#| fig-height: 6
#| fig-cap: "Diagnostics for CHD model"
#| label: fig-chd-model-diagnostics

chd_glm_strat_grouped |> autoplot(which = 1:6) |> print()
```

::: notes

Things look pretty good here. The QQ plot is still usable; with large
samples; the residuals should be approximately Gaussian.

:::

#### Beetles

Let's look at the beetles model diagnostic plots for comparison:

```{r}
#| fig-height: 6
#| label: fig-beetles-glm-diag
#| fig-cap: "Diagnostics for logistic model of `BeetleMortality` data"
beetles_glm_grouped |> autoplot(which = 1:6) |> print()
```


Hard to tell much from so little data, but there might be some issues
here.

## Other link functions for Bernoulli outcomes

If you want risk ratios, you can sometimes get them by changing the link
function:

```{r}

data(anthers, package = "dobson")
anthers.sum<-aggregate(
  anthers[c("n","y")], 
  by=anthers[c("storage")],FUN=sum) 

anthers_glm_log = glm(
  formula = cbind(y,n-y)~storage,
  data=anthers.sum, 
  family=binomial(link="log"))

anthers_glm_log |> parameters() |> print_md()

```

---

Now $\exp{\beta}$ gives us risk ratios instead of odds ratios:

```{r}
anthers_glm_log |> parameters(exponentiate = TRUE) |> print_md()

```

---

Let's compare this model with a logistic model:

```{r}

anthers_glm_logit = glm(
  formula = cbind(y, n - y) ~ storage,
  data = anthers.sum,
  family = binomial(link = "logit"))

anthers_glm_logit |> parameters(exponentiate = TRUE) |> print_md()

```

\[to add: fitted plots on each outcome scale\]

---

When I try to use `link ="log"` in practice, I often get errors about
not finding good starting values for the estimation procedure. 
This is likely because the model is producing fitted probabilities greater than
1.

When this happens, you can try to fit Poisson regression models instead
(we will see those soon!). 
But then the outcome distribution isn't quite
right, and you won't get warnings about fitted probabilities greater
than 1. 
In my opinion, the Poisson model for binary outcomes is
confusing and not very appealing.

### WCGS: link functions

```{r}
wcgs_glm_logit_link = 
  chd_grouped_data |> 
  mutate(type = dibpat |> relevel(ref = "Type B")) |> 
  glm(
    "formula" = cbind(x, `n - x`) ~ dibpat * age, 
    "data" = _,
    "family" = binomial(link = "logit")
  )

wcgs_glm_identity_link = 
  chd_grouped_data |> 
  mutate(type = dibpat |> relevel(ref = "Type B")) |> 
  glm(
    "formula" = cbind(x, `n - x`) ~ dibpat * age, 
    "data" = _,
    "family" = binomial(link = "identity")
  )
wcgs_glm_identity_link |> coef() |> pander()
```

```{r}
#| label: fig-diagnostics-glm1
#| fig-cap: "Residuals vs Fitted plot for `wcgs` models"
#| layout-ncol: 2
#| fig-subcap: 
#| - "Logistic link"
#| - "Identity link"

library(ggfortify)
wcgs_glm_logit_link |> autoplot(which = c(1), ncol = 1) + facet_wrap(~dibpat)
wcgs_glm_identity_link |> autoplot(which = c(1), ncol = 1) + facet_wrap(~dibpat)
```

```{r}
#| label: fig-diagnostics-beetles
#| fig-cap: "Residuals vs Fitted plot for `BeetleMortality` models"
#| layout-ncol: 2
#| fig-subcap: 
#| - "Logistic link"
#| - "Identity link"

beetles_lm = 
  beetles_long |> 
  lm(formula = died ~ dose)

beetles = 
  beetles |> mutate(
    resid_logit = beetles_glm_grouped |> resid(type = "response"))
beetles_glm_grouped |> autoplot(which = c(1), ncol = 1)
beetles_lm |> autoplot(which = c(1), ncol = 1)
```

## Quasibinomial

See [Hua Zhou](https://hua-zhou.github.io/)'s [lecture notes](https://ucla-biostat-200c-2020spring.github.io/slides/04-binomial/binomial.html#:~:text=0.05%20%27.%27%200.1%20%27%20%27%201-,Quasi%2Dbinomial,-Another%20way%20to)

## Further reading

- @hosmer2013applied is a classic textbook on logistic regression