-
Notifications
You must be signed in to change notification settings - Fork 10
/
README.Rmd
260 lines (157 loc) · 10.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#",
fig.path = "tools/"
)
```
# algstat
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/algstat)](https://cran.r-project.org/package=algstat)
[![Travis build status](https://travis-ci.org/dkahle/algstat.svg?branch=master)](https://travis-ci.org/dkahle/algstat)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/dkahle/algstat?branch=master&svg=true)](https://ci.appveyor.com/project/dkahle/algstat)
<!-- badges: end -->
__algstat__ is a collection of tools to help you use algebraic statistical methods in R. It is intended to be an end user package for that purpose, and consequently depends on other packages that make connections to math software to do key computations. Currently, [the __latte__ package](https://github.com/dkahle/latte.git) is used to connect R to [LattE](https://www.math.ucdavis.edu/~latte/) with [4ti2](http://www.4ti2.de) for lattice problems and the computation of Markov bases, [the __m2r__ package](https://github.com/coneill-math/m2r.git) connects R to [Macaulay2](http://www.math.uiuc.edu/Macaulay2/) for algebraic computations, and [the __bertini__ package](https://github.com/dkahle/bertini) connects R to [Bertini](https://bertini.nd.edu), which is used to numerically solve systems of polynomial equations. These are at varying stages of development. __m2r__ can be used for that purpose to some extent currently by using Macaulay2's connection to [PHCPack](http://homepages.math.uic.edu/~jan/download.html), so be sure to look there for now if that's what you're trying to do.
If you have something you're wanting implemented here, please file an issue!
# Exact inference with log-linear models
_Note: this section assumes you have [__latte__](https://github.com/dkahle/latte.git) installed and working on your machine._
One of the most well-developed parts of the package allows users to perform (conditional) exact tests for log-linear models. There are several great references on the math behind this, such as [Diaconis and Sturmfels' original paper](http://projecteuclid.org/euclid.aos/1030563990), the [Lectures on Algebraic Statistics](http://smile.amazon.com/Lectures-Algebraic-Statistics-Oberwolfach-Seminars/dp/3764389044/ref=sr_1_1?ie=UTF8&qid=1430536908&sr=8-1&keywords=lectures+on+algebraic+statistics), and [Markov Bases in Algebraic Statistics](http://smile.amazon.com/Markov-Bases-Algebraic-Statistics-Springer/dp/1461437180/ref=sr_1_fkmr0_1?ie=UTF8&qid=1430536933&sr=8-1-fkmr0&keywords=aoki%2C+hada%2C+and+takemura), so we'll keep the technical discussion to a minimum.
### Fisher's exact test
We'll begin by doing Fisher's exact test on a built-in dataset called politics.
```{r politics}
library("algstat")
data(politics)
politics
```
Here's how you typically do Fisher's exact test in R:
```{r fisher}
fisher.test(politics)
```
Since the independence model is log-linear, this exact same procedure can be done with __algstat__. The go-to function here is `loglinear()`:
```{r loglinear-intro-prep, echo=FALSE}
set.seed(1L)
```
```{r loglinear-intro}
loglinear(~ Personality + Party, data = politics)
```
Exact inference in algebraic statistics is done using MCMC to sample from the conditional distribution of the data given its sufficient statistics under the model. Consequently, the p-values estimated are only determined up to Monte Carlo error. The standard p-value is given under the column `p.value` in the row labeled `P(samp)`.
The asymptotic test of independence analogous to Fisher's exact test (which does not condition on the marginals being known) can be done in either of two ways.
The first way uses the `loglin()` function from the __stats__ package. It outputs the likelihood ratio statistic (`Likelihood G^2` in the output above) and Pearson's chi-squared statistic (`Pearson X^2` above), but you have to calculate the p-value yourself.
```{r loglin}
(loglinMod <- stats::loglin(politics, list(1, 2)))
pchisq(loglinMod$pearson, df = 1, lower.tail = FALSE)
```
The second way is the `loglm()` function in the __MASS__ package, which is a nice wrapper of `loglin()` (in fact, __algstat__'s `loglinear()` function uses the IPF implementation from `loglin()`, although it doesn't need to). It's syntax looks identical to `loglinear()`'s above:
```{r loglm}
MASS::loglm(~ Personality + Party, data = politics)
```
### Fisher's exact test on RxC tables
Doing Fisher's exact test on larger problems is a significantly more complicated problem. The documentation for `fisher.test()` illustrates how it can be used on RxC tables in general, not just on 2x2 tables. Here's an example from its documentation drawn from Agresti (2002, p.57):
```{r big-fisher}
Job <- matrix(
c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), nrow = 4, ncol = 4,
dimnames = list(
"income" = c("< 15k", "15-25k", "25-40k", "> 40k"),
"satisfaction" = c("VeryD", "LittleD", "ModerateS", "VeryS")
)
)
Job
fisher.test(Job)
```
Here's the __algstat__ counterpart:
```{r big-fisher-log-linear}
loglinear(~ income + satisfaction, data = Job)
```
The asymptotic test can be performed as well. The chi-square approximation is actually very good here:
```{r big-fisher-loglm}
MASS::loglm(~ income + satisfaction, data = Job)
```
### Fisher's exact test on multi-way tables
`fisher.test()` does not work with multi-way tables and is prone to crashing even in large-celled two-way tables (see `?loglinear` for an example). Thus, the only way to do exact inference in multi-way tables is to use `loglinear()`. We'll illustrate this using the drugs dataset from `loglinear()`'s documentation, taken from Agresti (2002, p.322), on which we'll test the no-three-way interaction model:
```{r log-linear-multi-way}
data(drugs)
ftable(drugs)
loglinear(subsets(1:3, 2), data = drugs)
```
Note that here we've used the more concise syntax of facet specification; if you want to understand the model specification better, read the documentation in `?loglinear`. You can perform the asymptotic test with `loglm()` like this:
```{r log-lm-multi-way}
MASS::loglm(~ 1*2 + 2*3 + 1*3, data = drugs)
```
# Other statistical applications of LattE and 4ti2
_Note: this section assumes you have [__latte__](https://github.com/dkahle/latte.git) installed and working on your machine._
Most [LattE](https://www.math.ucdavis.edu/~latte/) programs are available as functions in __latte__, which is imported by __algstat__. Checkout the readme for __latte__ [here](https://github.com/dkahle/latte).
There are many statistical applications and potential applications of LattE in R. One example is found in the `count` program, implemented in `latte::latte_count()`. `latte::latte_count()` counts the number of integer points in a [convex polytope](https://en.wikipedia.org/wiki/Convex_polytope). This can be useful for counting the number of contingency tables with fixed marginals. __algstat__ uses `latte::latte_count()` in the `count_tables()` function, which determines the number of contingency tables in the [fiber (isostatistical region)](http://en.wikipedia.org/wiki/Fiber_(mathematics)) of a table given an [exponential family model](http://en.wikipedia.org/wiki/Exponential_family).
```{r count-tables}
count_tables(politics) # the independence model is the default
```
For example, we can determine the number of tables with the same row sums of `politics` as follows:
```{r count-tables-2}
(A <- hmat(varlvls = c(2, 2), facets = 1:2)[1:2,])
count_tables(politics, A)
```
# Numerically solving systems of polynomial equations
_Note: this section assumes you have [Bertini](https://bertini.nd.edu) installed and algstat has registered it._
__algstat__ also provides back-end connections to [Bertini](https://bertini.nd.edu) to solve systems of polynomial equations. While this work is still being implemented, here's a peak at what it can currently do.
First, __algstat__ can run raw Bertini programs using `bertini()`. It also has a nice print method to display the results. For example, here's how you would find the intersection of the line f(x) = x and the unit circle using Bertini:
```{r raw-bertini}
code <- "
INPUT
variable_group x, y;
function f, g;
f = x^2 + y^2 - 1;
g = y - x;
END;
"
bertini(code)
```
Even better, __algstat__ can team up with [__mpoly__](http://github.com/dkahle/mpoly) (working under the hood) to solve systems of polynomial equations using `poly_solve()`:
```{r poly-solve, fig.height = 3}
library("ggplot2"); theme_set(theme_minimal())
ggvariety(mp("(y - x^2) (y - (2 - x^2))"), xlim = c(-2,2), ylim = c(0,2), n = 351)
poly_solve(c("y = x^2", "y = 2 - x^2"), varorder = c("x", "y"))
```
# Variety normal distribution
```{r rvnorm, message=FALSE}
options("mc.cores" = parallel::detectCores())
p <- mp("(x^2 + y^2 - 1)^3 - x^2 y^3")
(samps <- rvnorm(500, p, .025, "tibble", chains = 8))
ggplot(samps, aes(x, y)) + geom_point(size = .5) + coord_equal()
ggplot(samps, aes(x, y, color = iter)) +
geom_point(size = .5) + geom_path(alpha = .3) +
coord_equal() + facet_wrap(~ factor(chain), nrow = 2)
```
For semi-algebraic sets:
```{r rvnorm-semi-alg, message=FALSE}
p <- mp("(x^2 + y^2 - 1)^3 - x^2 y^3 + s^2")
samps <- rvnorm(500, p, .025, "tibble", chains = 8)
ggplot(samps, aes(x, y)) + geom_point(size = .5) + coord_equal()
```
After being migrated onto the variety or semi-algebraic set, these can be used
as a mesh on that geometry. Here's a cool image made using [__ggforce__](https://github.com/thomasp85/ggforce)'s `geom_voronoi_segment()`:
```{r mesh, message=FALSE}
library("ggforce")
ggplot(samps, aes(x, y)) + geom_voronoi_segment() + coord_equal()
```
# Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant Nos. [1622449](https://nsf.gov/awardsearch/showAward?AWD_ID=1622449) and [1622369](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1622369).
# Installation
## Installing algstat
There are currently two ways to get __algstat__. Here's the preferred way until we push a new release to CRAN:
* From Github:
```{r eval=FALSE}
if (!requireNamespace("devtools")) install.packages("devtools")
devtools::install_github("dkahle/mpoly")
devtools::install_github("coneill-math/m2r")
devtools::install_github("dkahle/latte")
devtools::install_github("dkahle/bertini")
devtools::install_github("dkahle/algstat")
```
* From CRAN: `install.packages("algstat")` (don't get this now; it's currently out of date)
## Installing supporting software
Coming soon! See the links above for direct information.