Skip to content

Commit

Permalink
More on model building
Browse files Browse the repository at this point in the history
Including some needed material in model-basics
  • Loading branch information
hadley committed Jul 26, 2016
1 parent 62824d9 commit 8865099
Show file tree
Hide file tree
Showing 4 changed files with 198 additions and 126 deletions.
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ URL: https://github.com/hadley/r4ds
Imports:
bookdown,
broom,
condvis,
dplyr,
gapminder,
ggplot2,
Expand Down
24 changes: 0 additions & 24 deletions extra/heights.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -299,30 +299,6 @@ tibble(education = seq(5, 25)) %>%
```


Other useful arguments to `seq_range()`:

* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:

```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
values near the center:
```{r}
x <- rcauchy(100)
seq_range(x, n = 5)
seq_range(x, n = 5, trim = 0.10)
seq_range(x, n = 5, trim = 0.25)
seq_range(x, n = 5, trim = 0.50)
```
### Additive models


Expand Down
127 changes: 126 additions & 1 deletion model-basics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -475,7 +475,39 @@ grid <- sim4 %>%
grid
```

Note my use of `seq_range()` inside `expand()`. Instead of using every unique value of `x`, I'm going to use a regularly spaced grid of five values between the minimum and maximum numbers. It's probably not super important here, but it's a useful technique in general.
Note my use of `seq_range()` inside `expand()`. Instead of using every unique value of `x`, I'm going to use a regularly spaced grid of five values between the minimum and maximum numbers. It's probably not super important here, but it's a useful technique in general. There are two other useful arguments to `seq_range()`:

* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:

```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
values near the center:
```{r}
x1 <- rcauchy(100)
seq_range(x1, n = 5)
seq_range(x1, n = 5, trim = 0.10)
seq_range(x1, n = 5, trim = 0.25)
seq_range(x1, n = 5, trim = 0.50)
```
* `expand = 0.1` is in some sense the opposite of `trim()` it expands the
range by 10%.
```{r}
x2 <- c(0, 1)
seq_range(x2, n = 5)
seq_range(x2, n = 5, expand = 0.10)
seq_range(x2, n = 5, expand = 0.25)
seq_range(x2, n = 5, expand = 0.50)
```
Next let's try and visualise that model. We have two continuous predictors, so you can imagine the model like a 3d surface. We could display that using `geom_tile()`:
Expand Down Expand Up @@ -514,6 +546,97 @@ model_matrix(df, y ~ x^2 + x)
model_matrix(df, y ~ I(x^2) + x)
```

Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a linear to get arbitrary close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:

```{r}
model_matrix(df, y ~ poly(x, 2))
```

However there's one major problem with using `poly()`: outside the range of the data, polynomials rapidly shoot off to positive or negative infinity. One safer alternative is to use the natural spline, `splines::ns()`.

```{r}
library(splines)
model_matrix(df, y ~ ns(x, 2))
```

Let's see what that looks like when we try and approximate a non-linear function:

```{r}
sim5 <- tibble(
x = seq(0, 3.5 * pi, length = 50),
y = 4 * sin(x) + rnorm(length(x))
)
ggplot(sim5, aes(x, y)) +
geom_point()
```

I'm going to fit five models to this data.

```{r}
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)
grid <- sim5 %>%
expand(x = seq_range(x, n = 50, expand = 0.1)) %>%
gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")
ggplot(sim5, aes(x, y)) +
geom_point() +
geom_line(data = grid, colour = "red") +
facet_wrap(~ model)
```

Notice that the extrapolation outside the range of the data is clearly bad. This is the downside to approximating a function with a polynomial. But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory or science.

### Interpolation vs. extrapolation

So far, when we've visualised the predictions from a model, we've been careful to overlay them on the data. This is important because it helps make sure we're not extrapolating the model to data that is far away from what we've observed.

However, as you start working with large datasets overlaying all the data will become increasing challenging. Particularly as you'll often use a model to simplify away some of the complexities of the raw data! To make your life easier, you might be tempted to just display the predictions. This is dangerous because you might have accidentally generated predictions that are very far away from your data.

As a compromise, you can use the convenient `similarityweight()` function from the __condvis__ package by Mark O'Connell. You give it your prediction grid and your dataset, and it computes the similarlity between each observation in your original dataset and the prediction grid. You can then display only the points that are close to your predictions. If no points are close, you know that your prediction grid is dangerous!

```{r}
sim6 <- tibble(
x1 = rbeta(1000, 2, 5),
x2 = rbeta(1000, 2, 5),
y = 2 * x1 * x2 + x1 + 2 + rnorm(length(x1))
)
mod <- lm(y ~ x1 * x2, data = sim6)
grid <- sim6 %>%
expand(
x1 = seq_range(x1, 10),
x2 = c(0, 0.5, 1, 1.5)
) %>%
add_predictions(mod)
add_similarity <- function(data, grid, ..., .similarity = "sim") {
common <- intersect(names(data), names(grid))
message("Using ", paste(common, collapse = ", "))
sim_m <- condvis::similarityweight(grid, data[common], ...)
sim <- apply(sim_m, 2, max)
data[[.similarity]] <- sim
data
}
sim6 %>%
add_similarity(grid) %>%
filter(sim > 0) %>%
ggplot(aes(x1, y, group = x2)) +
geom_point(aes(alpha = sim)) +
geom_line(data = grid, aes(y = pred)) +
scale_alpha(limit = c(0, NA))
```

### Exercises

1. What happens if you repeat the analysis of `sim2` using a model without
Expand Down Expand Up @@ -562,6 +685,8 @@ nrow(df)

## Other model families

Pattern in variance, not just mean. Or other characteristics of distribution of residuals.

This chapter has focussed exclusively on the class of linear models, which assume a relationship of the form `y = a_1 * x1 + a_2 * x2 + ... + a_n * xn`. Linear models additionally assume that the residuals have a normal distribution, which we haven't talked about. There are a large set of model classes that extend the linear model in various interesting ways. Some of them are:

* __Generalised linear models__, e.g. `stats::glm()`. Linear models assume that
Expand Down
Loading

0 comments on commit 8865099

Please sign in to comment.