Merge pull request #275 from BCG-Gamma/dev/1.1.0

BCG-X-Official · Mar 29, 2021 · c7d87e7 · c7d87e7
2 parents ef1b12c + 19ec93b
commit c7d87e7
Show file tree

Hide file tree

Showing 24 changed files with 1,374 additions and 23,989 deletions.
diff --git a/README.rst b/README.rst
@@ -127,15 +127,16 @@ hyperparameter configurations and even multiple learners with the `LearnerRanker
 
     # create a (trivial) pipeline for a random forest regressor
     rnd_forest_reg = RegressorPipelineDF(
-        regressor=RandomForestRegressorDF(random_state=42)
+        regressor=RandomForestRegressorDF(n_estimators=200, random_state=42)
     )
 
     # define grid of models which are "competing" against each other
     rnd_forest_grid = [
         LearnerGrid(
             pipeline=rnd_forest_reg,
             learner_parameters={
-                "min_samples_leaf": [8, 11, 15]
+                "min_samples_leaf": [8, 11, 15],
+                "max_depth": [4, 5, 6],
             }
         ),
     ]
@@ -155,9 +156,10 @@ hyperparameter configurations and even multiple learners with the `LearnerRanker
    :width: 600
 
 We can see based on this minimal workflow that a value of 11 for minimum
-samples in the leaf was the best performing of the three considered values.
-This approach easily extends to multiple hyperparameters for the learner
-and multiple learners.
+samples in the leaf and 5 for maximum tree depth was the best performing
+of the three considered values.
+This approach easily extends to additional hyperparameters for the learner,
+and for multiple learners.
 
 
 Model Inspection
@@ -233,22 +235,20 @@ features in a model are:
 
 For any feature pair (A, B), the first feature (A) is the row, and the second
 feature (B) the column. For example, looking across the row for `LTG` (Lamotrigine)
-there is relatively minimal synergy (≤14%) with other features in the model.
+there is relatively minimal synergy (≤1%) with other features in the model.
 However, looking down the column for `LTG` (i.e., perspective of other features
-in a pair with `LTG`) we find many features (the rows) are synergistic (12% to 34%)
+in a pair with `LTG`) we find many features (the rows) are synergistic (up to 27%)
 with `LTG`. We can conclude that:
 
+- `LTG` is a strongly autonomous feature, displaying minimal synergy with other
+  features for predicting disease progression after one year.
+- The contribution of other features to predicting disease progression after one
+  year is partly enabled by the presence of `LTG`.
 
--   `LTG` is a strongly autonomous feature, displaying minimal synergy with other
-    features for predicting disease progression after one year.
--   The contribution of other features to predicting disease progression after one
-    year is partly enabled by the strong contribution from `LTG`.
-
-
-High synergy features must be considered carefully when investigating impact,
-as they work together to predict the outcome. It would not make much sense to
-consider `TC` (T-Cells) without `LTG` given the 34% synergy of `TC` with `LTG`
-for predicting progression after one year.
+High synergy between pairs of features must be considered carefully when investigating
+impact, as the values of both features jointly determine the outcome. It would not make
+much sense to consider `TC` (T-Cells) without the context provided by `LDL` given close
+to 27% synergy of `LDL` with `LTG` for predicting progression after one year.
 
 **Redundancy**
 
@@ -261,20 +261,19 @@ for predicting progression after one year.
 .. image:: sphinx/source/_static/redundancy_matrix.png
     :width: 600
 
+
 For any feature pair (A, B), the first feature (A) is the row, and the second feature
 (B) the column. For example, if we look at the feature pair (`LDL`, `TC`) from the
 perspective of `LDL` (Low-Density Lipoproteins), then we look-up the row for `LDL`
-and the column for `TC` and find 47% redundancy. This means that 47% of the
-information in `LDL` is duplicated with `TC` to predict disease progression
-after one year. This redundancy is similar when looking "from the perspective"
-of `TC` for (`TC`, `LDL`) which is 50%.
+and the column for `TC` and find 38% redundancy. This means that 38% of the information
+in `LDL` is duplicated with `TC` to predict disease progression after one year. This
+redundancy is the same when looking "from the perspective" of `TC` for (`TC`, `LDL`),
+but need not be symmetrical in all cases (see `LTG` vs. `TSH`).
 
+If we look at `TSH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but
+the same does not hold between `LTG` and `HDL` – meaning `TSH` shares different
+information with each of the two features.
 
-If we look across the columns for the `LTG` row we can see that apart from the
-32% redundancy with `BMI`, `LTG` has minimal redundancy (<9%) with the other
-features included in the model. Further, if we look cross the rows for the
-`LTG` column we can see a number of the features have moderate redundancy
-with `LTG`.
 
 **Clustering redundancy**
 
@@ -302,10 +301,9 @@ Let's look at the example for redundancy.
     :width: 600
 
 Based on the dendrogram we can see that the feature pairs (`LDL`, `TC`)
-and (`LTG`, `BMI`: body mass index) each represent a cluster in the
-dendrogram and that `LTG` and `BMI` have high the highest importance.
-As potential next actions we could remove `TC` and explore the impact of
-removing one of `LTG` or `BMI` to further simplify the model and obtain a
+and (`HDL`, `TSH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`
+have the highest importance. As potential next actions we could explore the impact of
+removing `TSH`, and one of `TC` or `LDL` to further simplify the model and obtain a
 reduced set of independent features.
 
 Please see the
@@ -316,22 +314,23 @@ for more detail.
 Model Simulation
 ~~~~~~~~~~~~~~~~
 
-Taking the `BMI` feature as an example, we do the following for the simulation:
-
--   We use FACET's `ContinuousRangePartitioner` to split the range of observed values of
-    `BMI` into intervals of equal size. Each partition is represented by the central
-    value of that partition.
--   For each partition, the simulator creates an artificial copy of the original sample
-    assuming the variable to be simulated has the same value across all observations -
-    which is the value representing the partition. Using the best `LearnerCrossfit`
-    acquired from the ranker, the simulator now re-predicts all targets using the models
-    trained for all folds and determines the average uplift of the target variable
-    resulting from this.
--   The FACET `SimulationDrawer` allows us to visualise the result; both in a
-    matplotlib and a plain-text style.
+Taking the `BMI` feature as an example of an important and highly independent feature,
+we do the following for the simulation:
+
+- We use FACET's `ContinuousRangePartitioner` to split the range of observed values of
+  `BMI` into intervals of equal size. Each partition is represented by the central value
+  of that partition.
+- For each partition, the simulator creates an artificial copy of the original sample
+  assuming the variable to be simulated has the same value across all observations –
+  which is the value representing the partition. Using the best `LearnerCrossfit`
+  acquired from the ranker, the simulator now re-predicts all targets using the models
+  trained for all folds and determines the average uplift of the target variable
+  resulting from this.
+- The FACET `SimulationDrawer` allows us to visualise the result; both in a
+  *matplotlib* and a plain-text style.
 
 Finally, because FACET can use bootstrap cross validation, we can create a crossfit
-from our previous `LearnerRanker` best model to perform the simulation so we can
+from our previous `LearnerRanker` best model to perform the simulation, so we can
 quantify the uncertainty by using bootstrap confidence intervals.
 
 .. code-block:: Python
@@ -373,7 +372,6 @@ an increase in disease progression after one year, and that for a `BMI` of 29
 and above, there is a significant increase in disease progression after one year
 of at least 26 points.
 
-
 Contributing
 ------------
 

diff --git a/RELEASE_NOTES.rst b/RELEASE_NOTES.rst
@@ -11,13 +11,16 @@ by the :class:`.LearnerInspector`.
 ~~~~~
 
 - API: SHAP interaction vectors can (in part) also be influenced by redundancy among
-  features. FACET now corrects interaction vectors for redundancy prior to calculating
-  synergy. Technically we ensure that the interaction vector is orthogonal w.r.t the
-  main effect vectors of both associated features.
+  features. This can inflate quantificatios of synergy, especially in cases where two
+  variables are highly redundant. FACET now corrects interaction vectors for redundancy
+  prior to calculating synergy. Technically we ensure that each interaction vector is
+  orthogonal w.r.t the main effect vectors of both associated features.
 - API: FACET now calculates synergy, redundancy, and association separately for each
   model in a crossfit, then returns the mean of all resulting matrices. This leads to a
   slight increase in accuracy, and also allows us to calculate the standard deviation
   across matrices as an indication of confidence for each calculated value.
+- API: Method :meth:`.LernerInspector.shap_plot_data` now returns SHAP values for the
+  positive class of binary classifiers.
 - API: Increase efficiency of :class:`.LearnerRanker` parallelization by adopting the
   new :class:`pytools.parallelization.JobRunner` API provided by :mod:`pytools`
 - BUILD: add support for :mod:`shap` 0.38 and 0.39