Skip to content

Commit

Permalink
Merge pull request #275 from BCG-Gamma/dev/1.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
j-ittner authored Mar 29, 2021
2 parents ef1b12c + 19ec93b commit c7d87e7
Show file tree
Hide file tree
Showing 24 changed files with 1,374 additions and 23,989 deletions.
88 changes: 43 additions & 45 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,15 +127,16 @@ hyperparameter configurations and even multiple learners with the `LearnerRanker
# create a (trivial) pipeline for a random forest regressor
rnd_forest_reg = RegressorPipelineDF(
regressor=RandomForestRegressorDF(random_state=42)
regressor=RandomForestRegressorDF(n_estimators=200, random_state=42)
)
# define grid of models which are "competing" against each other
rnd_forest_grid = [
LearnerGrid(
pipeline=rnd_forest_reg,
learner_parameters={
"min_samples_leaf": [8, 11, 15]
"min_samples_leaf": [8, 11, 15],
"max_depth": [4, 5, 6],
}
),
]
Expand All @@ -155,9 +156,10 @@ hyperparameter configurations and even multiple learners with the `LearnerRanker
:width: 600

We can see based on this minimal workflow that a value of 11 for minimum
samples in the leaf was the best performing of the three considered values.
This approach easily extends to multiple hyperparameters for the learner
and multiple learners.
samples in the leaf and 5 for maximum tree depth was the best performing
of the three considered values.
This approach easily extends to additional hyperparameters for the learner,
and for multiple learners.


Model Inspection
Expand Down Expand Up @@ -233,22 +235,20 @@ features in a model are:

For any feature pair (A, B), the first feature (A) is the row, and the second
feature (B) the column. For example, looking across the row for `LTG` (Lamotrigine)
there is relatively minimal synergy (≤14%) with other features in the model.
there is relatively minimal synergy (≤1%) with other features in the model.
However, looking down the column for `LTG` (i.e., perspective of other features
in a pair with `LTG`) we find many features (the rows) are synergistic (12% to 34%)
in a pair with `LTG`) we find many features (the rows) are synergistic (up to 27%)
with `LTG`. We can conclude that:

- `LTG` is a strongly autonomous feature, displaying minimal synergy with other
features for predicting disease progression after one year.
- The contribution of other features to predicting disease progression after one
year is partly enabled by the presence of `LTG`.

- `LTG` is a strongly autonomous feature, displaying minimal synergy with other
features for predicting disease progression after one year.
- The contribution of other features to predicting disease progression after one
year is partly enabled by the strong contribution from `LTG`.


High synergy features must be considered carefully when investigating impact,
as they work together to predict the outcome. It would not make much sense to
consider `TC` (T-Cells) without `LTG` given the 34% synergy of `TC` with `LTG`
for predicting progression after one year.
High synergy between pairs of features must be considered carefully when investigating
impact, as the values of both features jointly determine the outcome. It would not make
much sense to consider `TC` (T-Cells) without the context provided by `LDL` given close
to 27% synergy of `LDL` with `LTG` for predicting progression after one year.

**Redundancy**

Expand All @@ -261,20 +261,19 @@ for predicting progression after one year.
.. image:: sphinx/source/_static/redundancy_matrix.png
:width: 600


For any feature pair (A, B), the first feature (A) is the row, and the second feature
(B) the column. For example, if we look at the feature pair (`LDL`, `TC`) from the
perspective of `LDL` (Low-Density Lipoproteins), then we look-up the row for `LDL`
and the column for `TC` and find 47% redundancy. This means that 47% of the
information in `LDL` is duplicated with `TC` to predict disease progression
after one year. This redundancy is similar when looking "from the perspective"
of `TC` for (`TC`, `LDL`) which is 50%.
and the column for `TC` and find 38% redundancy. This means that 38% of the information
in `LDL` is duplicated with `TC` to predict disease progression after one year. This
redundancy is the same when looking "from the perspective" of `TC` for (`TC`, `LDL`),
but need not be symmetrical in all cases (see `LTG` vs. `TSH`).

If we look at `TSH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but
the same does not hold between `LTG` and `HDL` – meaning `TSH` shares different
information with each of the two features.

If we look across the columns for the `LTG` row we can see that apart from the
32% redundancy with `BMI`, `LTG` has minimal redundancy (<9%) with the other
features included in the model. Further, if we look cross the rows for the
`LTG` column we can see a number of the features have moderate redundancy
with `LTG`.

**Clustering redundancy**

Expand Down Expand Up @@ -302,10 +301,9 @@ Let's look at the example for redundancy.
:width: 600

Based on the dendrogram we can see that the feature pairs (`LDL`, `TC`)
and (`LTG`, `BMI`: body mass index) each represent a cluster in the
dendrogram and that `LTG` and `BMI` have high the highest importance.
As potential next actions we could remove `TC` and explore the impact of
removing one of `LTG` or `BMI` to further simplify the model and obtain a
and (`HDL`, `TSH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`
have the highest importance. As potential next actions we could explore the impact of
removing `TSH`, and one of `TC` or `LDL` to further simplify the model and obtain a
reduced set of independent features.

Please see the
Expand All @@ -316,22 +314,23 @@ for more detail.
Model Simulation
~~~~~~~~~~~~~~~~

Taking the `BMI` feature as an example, we do the following for the simulation:

- We use FACET's `ContinuousRangePartitioner` to split the range of observed values of
`BMI` into intervals of equal size. Each partition is represented by the central
value of that partition.
- For each partition, the simulator creates an artificial copy of the original sample
assuming the variable to be simulated has the same value across all observations -
which is the value representing the partition. Using the best `LearnerCrossfit`
acquired from the ranker, the simulator now re-predicts all targets using the models
trained for all folds and determines the average uplift of the target variable
resulting from this.
- The FACET `SimulationDrawer` allows us to visualise the result; both in a
matplotlib and a plain-text style.
Taking the `BMI` feature as an example of an important and highly independent feature,
we do the following for the simulation:

- We use FACET's `ContinuousRangePartitioner` to split the range of observed values of
`BMI` into intervals of equal size. Each partition is represented by the central value
of that partition.
- For each partition, the simulator creates an artificial copy of the original sample
assuming the variable to be simulated has the same value across all observations –
which is the value representing the partition. Using the best `LearnerCrossfit`
acquired from the ranker, the simulator now re-predicts all targets using the models
trained for all folds and determines the average uplift of the target variable
resulting from this.
- The FACET `SimulationDrawer` allows us to visualise the result; both in a
*matplotlib* and a plain-text style.

Finally, because FACET can use bootstrap cross validation, we can create a crossfit
from our previous `LearnerRanker` best model to perform the simulation so we can
from our previous `LearnerRanker` best model to perform the simulation, so we can
quantify the uncertainty by using bootstrap confidence intervals.

.. code-block:: Python
Expand Down Expand Up @@ -373,7 +372,6 @@ an increase in disease progression after one year, and that for a `BMI` of 29
and above, there is a significant increase in disease progression after one year
of at least 26 points.


Contributing
------------

Expand Down
9 changes: 6 additions & 3 deletions RELEASE_NOTES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,16 @@ by the :class:`.LearnerInspector`.
~~~~~

- API: SHAP interaction vectors can (in part) also be influenced by redundancy among
features. FACET now corrects interaction vectors for redundancy prior to calculating
synergy. Technically we ensure that the interaction vector is orthogonal w.r.t the
main effect vectors of both associated features.
features. This can inflate quantificatios of synergy, especially in cases where two
variables are highly redundant. FACET now corrects interaction vectors for redundancy
prior to calculating synergy. Technically we ensure that each interaction vector is
orthogonal w.r.t the main effect vectors of both associated features.
- API: FACET now calculates synergy, redundancy, and association separately for each
model in a crossfit, then returns the mean of all resulting matrices. This leads to a
slight increase in accuracy, and also allows us to calculate the standard deviation
across matrices as an indication of confidence for each calculated value.
- API: Method :meth:`.LernerInspector.shap_plot_data` now returns SHAP values for the
positive class of binary classifiers.
- API: Increase efficiency of :class:`.LearnerRanker` parallelization by adopting the
new :class:`pytools.parallelization.JobRunner` API provided by :mod:`pytools`
- BUILD: add support for :mod:`shap` 0.38 and 0.39
Expand Down
Loading

0 comments on commit c7d87e7

Please sign in to comment.