-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.rmd
195 lines (126 loc) · 8.26 KB
/
README.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
output: github_document
---
```{r setup, include=FALSE, message=FALSE, warning=TRUE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = 'center',
fig.path = "man/figures/")
```
# triplot <img src="man/figures/logo.png" align="right" width="150"/>
<!-- badges: start -->
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/triplot)](https://cran.r-project.org/package=triplot)
[![R build status](https://github.com/ModelOriented/triplot/workflows/R-CMD-check/badge.svg)](https://github.com/ModelOriented/triplot/actions?query=workflow%3AR-CMD-check)
[![Codecov test coverage](https://codecov.io/gh/ModelOriented/triplot/branch/master/graph/badge.svg)](https://codecov.io/gh/ModelOriented/triplot?branch=master)
[![DrWhy-eXtrAI](https://img.shields.io/badge/DrWhy-eXtrAI-4378bf)](http://drwhy.ai/#eXtraAI)
<!-- badges: end -->
## Introduction
The `triplot` package provides tools for exploration of machine learning predictive models. It contains an instance-level explainer called `predict_aspects` (AKA `aspects_importance`), that is able to explain the contribution of the whole groups of explanatory variables. Furthermore, package delivers functionality called `triplot` - it illustrates how the importance of aspects (group of predictors) change depending on the size of aspects.
Key functions:
* `predict_triplot()` and `model_triplot()` for instance- and data-level summary of automatic aspect importance grouping,
* `predict_aspects()` for calculating the feature groups importance (called aspects importance) for a selected observation,
* `group_variables()` for grouping of correlated numeric features into aspects.
The `triplot` package is a part of [DrWhy.AI](http://DrWhy.AI) universe. More information about analysis of machine learning models can be found in
the [Explanatory Model Analysis. Explore, Explain and Examine Predictive Models](https://pbiecek.github.io/ema/) e-book.
<center>
![](https://raw.githubusercontent.com/ModelOriented/triplot/master/README_files/triplot_explained.gif)
</center>
## Installation
```{r eval = FALSE}
# from CRAN:
install.packages("triplot")
# from GitHub (development version):
# install.packages("devtools")
devtools::install_github("ModelOriented/triplot")
```
## Overview
`triplot` shows, in one place:
* the importance of every single feature,
* hierarchical aspects importance,
* order of grouping features into aspects.
We can use it to investigate the **instance level** importance of features (using `predict_aspects()` function) or to illustrate the **model level** importance of features (using `model_parts()` function from DALEX package). `triplot` can be only used on numerical features. More information about this functionality can be found in [triplot overview](https://modeloriented.github.io/triplot/articles/vignette_aspect_importance.html#hierarchical-aspects-importance-1).
### Basic triplot for a model
To showcase `triplot`, we will choose `apartments` dataset from DALEX, use it's numeric features to build a model, create DALEX [explainer](https://modeloriented.github.io/DALEX/reference/explain.html), use `model_triplot()` to calculate the `triplot` object and then plot it with the generic `plot()` function.
#### Import `apartments` and train a linear model
```{r}
library("DALEX")
apartments_num <- apartments[,unlist(lapply(apartments, is.numeric))]
model_apartments <- lm(m2.price ~ ., data = apartments_num)
```
#### Create an explainer
```{r}
explain_apartments <- DALEX::explain(model = model_apartments,
data = apartments_num[, -1],
y = apartments_num$m2.price,
verbose = FALSE)
```
#### Create a triplot object
```{r model-triplot, fig.width = 8, fig.height = 2.5, fig.cap="The left panel shows the global importance of individual variables. Right panel shows global correlation structure visualized by hierarchical clustering. The middle panel shows the importance of groups of variables determined by the hierarchical clustering."}
set.seed(123)
library("triplot")
tri_apartments <- model_triplot(explain_apartments)
plot(tri_apartments) +
patchwork::plot_annotation(title = "Global triplot for four variables in the linear model")
```
At the model level, `surface` and `floor` have the biggest contributions. But we also know that `Number of rooms` and `surface` are strongly correlated and together have strong influence on the model prediction.`Construction year` has small influence on the prediction, is not correlated with `number of rooms` nor `surface` variables. Adding `construction year` to them, only slightly increases the importance of this group.
### Basic triplot for an observation
Afterwards, we are building triplot for single instance and it's prediction.
```{r predict-triplot, fig.width = 8, fig.height = 2.5, fig.cap="The left panel shows the local importance of individual variables (similar to LIME). Right panel shows global correlation structure visualized by hierarchical clustering. The middle panel shows the local importance of groups of variables (similar to LIME) determined by the hierarchical clustering."}
(new_apartment <- apartments_num[6, -1])
tri_apartments <- predict_triplot(explain_apartments,
new_observation = new_apartment)
plot(tri_apartments) +
patchwork::plot_annotation(title = "Local triplot for four variables in the linear model")
```
We can observe that for the given apartment `surface` has also significant, positive influence on the prediction. Adding `number of rooms`, increases its contribution. However, adding `construction year` to those two features, decreases the group importance.
We can notice that `floor` has the small influence on the prediction of this observation, unlike in the model-level analysis.
## Aspect importance for single instance
For this example we use `titanic` dataset with a logistic regression model that predicts passenger survival. Features are combined into thematic aspects.
### Importing dataset and building a logistic regression model
```{r}
set.seed(123)
model_titanic_glm <- glm(survived ~ ., titanic_imputed, family = "binomial")
```
### Manual selection of aspects
```{r}
aspects_titanic <-
list(
wealth = c("class", "fare"),
family = c("sibsp", "parch"),
personal = c("age", "gender"),
embarked = "embarked"
)
```
### Select an instance
We are interested in explaining the model prediction for the `johny_d` example.
```{r}
(johny_d <- titanic_imputed[2,])
predict(model_titanic_glm, johny_d, type = "response")
```
It turns out that the model prediction for this passenger's survival is very low. Let's see which aspects have the biggest influence on it.
We start with DALEX [explainer](https://modeloriented.github.io/DALEX/reference/explain.html).
```{r}
explain_titanic <- DALEX::explain(model_titanic_glm,
data = titanic_imputed,
y = titanic_imputed$survived,
label = "Logistic Regression",
verbose = FALSE)
```
And use it to call `triplot::predict_aspects()` function.
Afterwards, we print and plot function results
```{r aspect-importance, fig.width = 8, fig.height = 2.5}
library("triplot")
ai_titanic <- predict_aspects(x = explain_titanic,
new_observation = johny_d[,-8],
variable_groups = aspects_titanic)
print(ai_titanic, show_features = TRUE)
plot(ai_titanic)
```
We can observe that `wealth` (class, fare) variables have the biggest contribution to the prediction. This contribution is of a negative type. `Personal` (age, gender) and `Family` (sibsp, parch) variables have positive influence on the prediction, but it is much smaller. `Embarked` feature has very small, negative contribution to the prediction.
## Learn more
- [triplot package overview](https://modeloriented.github.io/triplot/articles/vignette_aspect_importance.html)
- [usecase with FIFA 20 data set](https://modeloriented.github.io/triplot/articles/vignette_aspect_importance_fifa.html)
- [description of predict aspects method](https://modeloriented.github.io/triplot/articles/vignette_aspect_importance_indepth.html)
## Acknowledgments
Work on this package was financially supported by the NCBR Grant POIR.01.01.01-00-0328/17.