Merge pull request #336 from BCG-Gamma/dev/2.0.dev1

BCG-X-Official · Apr 22, 2022 · 5e9e8a6 · 5e9e8a6
2 parents 819f0ad + ede0d22
commit 5e9e8a6
Show file tree

Hide file tree

Showing 45 changed files with 3,713 additions and 4,149 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -5,7 +5,7 @@ repos:
       - id: isort
 
   - repo: https://github.com/psf/black
-    rev: 20.8b1
+    rev: 22.3.0
     hooks:
       - id: black
         language_version: python3
@@ -26,3 +26,13 @@ repos:
       - id: check-added-large-files
       - id: check-json
       - id: check-yaml
+
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v0.931
+    hooks:
+      - id: mypy
+        files: src/
+        additional_dependencies:
+          - numpy>=1.22
+          - gamma-pytools>=2.0.dev8,<3a
+          - sklearndf>=2.0.dev3,<3a
diff --git a/README.rst b/README.rst
@@ -90,14 +90,14 @@ Enhanced Machine Learning Workflow
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 To demonstrate the model inspection capability of FACET, we first create a
-pipeline to fit a learner. In this simple example we use the
-`diabetes dataset <https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt>`__
+pipeline to fit a learner. In this simple example we will use the
+`diabetes dataset <https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data>`__
 which contains age, sex, BMI and blood pressure along with 6 blood serum
-measurements as features. A transformed version of this dataset is also available
-on scikit-learn
+measurements as features. This dataset was used in this
+`publication <https://statweb.stanford.edu/~tibs/ftp/lars.pdf>`__.
+A transformed version of this dataset is also available on scikit-learn
 `here <https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset>`__.
 
-
 In this quickstart we will train a Random Forest regressor using 10 repeated
 5-fold CV to predict disease progression after one year. With the use of
 *sklearndf* we can create a *pandas* DataFrame compatible workflow. However,
@@ -119,8 +119,22 @@ hyperparameter configurations and even multiple learners with the `LearnerRanker
     from facet.data import Sample
     from facet.selection import LearnerRanker, LearnerGrid
 
-    # load the diabetes dataset
-    diabetes_df = pd.read_csv('diabetes_quickstart.csv')
+    # declaring url with data
+    data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'
+
+    #importing data from url
+    diabetes_df = pd.read_csv(data_url, delimiter='\t').rename(
+        # renaming columns for better readability
+        columns={
+            'S1': 'TC', # total serum cholesterol
+            'S2': 'LDL', # low-density lipoproteins
+            'S3': 'HDL', # high-density lipoproteins
+            'S4': 'TCH', # total cholesterol/ HDL
+            'S5': 'LTG', # lamotrigine level
+            'S6': 'GLU', # blood sugar level
+            'Y': 'Disease_progression' # measure of progress since 1yr of baseline
+        }
+    )
 
     # create FACET sample object
     diabetes_sample = Sample(observations=diabetes_df, target_name="Disease_progression")
@@ -236,10 +250,10 @@ The key global metrics for each pair of features in a model are:
 
 For any feature pair (A, B), the first feature (A) is the row, and the second
 feature (B) the column. For example, looking across the row for `LTG` (Lamotrigine)
-there is relatively minimal synergy (≤1%) with other features in the model.
-However, looking down the column for `LTG` (i.e., perspective of other features
-in a pair with `LTG`) we find many features (the rows) are synergistic (up to 27%)
-with `LTG`. We can conclude that:
+there is hardly any synergy with other features in the model (≤ 1%).
+However, looking down the column for `LTG` (i.e., from the perspective of other features
+relative with `LTG`) we find that many features (the rows) are aided by synergy with
+with `LTG` (up to 27% in the case of LDL). We conclude that:
 
 - `LTG` is a strongly autonomous feature, displaying minimal synergy with other
   features for predicting disease progression after one year.
@@ -248,7 +262,7 @@ with `LTG`. We can conclude that:
 
 High synergy between pairs of features must be considered carefully when investigating
 impact, as the values of both features jointly determine the outcome. It would not make
-much sense to consider `TC` (T-Cells) without the context provided by `LDL` given close
+much sense to consider `LDL` without the context provided by `LTG` given close
 to 27% synergy of `LDL` with `LTG` for predicting progression after one year.
 
 **Redundancy**
@@ -267,12 +281,12 @@ For any feature pair (A, B), the first feature (A) is the row, and the second fe
 (B) the column. For example, if we look at the feature pair (`LDL`, `TC`) from the
 perspective of `LDL` (Low-Density Lipoproteins), then we look-up the row for `LDL`
 and the column for `TC` and find 38% redundancy. This means that 38% of the information
-in `LDL` is duplicated with `TC` to predict disease progression after one year. This
+in `LDL` to predict disease progression is duplicated in `TC`. This
 redundancy is the same when looking "from the perspective" of `TC` for (`TC`, `LDL`),
-but need not be symmetrical in all cases (see `LTG` vs. `TSH`).
+but need not be symmetrical in all cases (see `LTG` vs. `TCH`).
 
-If we look at `TSH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but
-the same does not hold between `LTG` and `HDL` – meaning `TSH` shares different
+If we look at `TCH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but
+the same does not hold between `LTG` and `HDL` – meaning `TCH` shares different
 information with each of the two features.
 
 
@@ -302,9 +316,9 @@ Let's look at the example for redundancy.
     :width: 600
 
 Based on the dendrogram we can see that the feature pairs (`LDL`, `TC`)
-and (`HDL`, `TSH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`
+and (`HDL`, `TCH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`
 have the highest importance. As potential next actions we could explore the impact of
-removing `TSH`, and one of `TC` or `LDL` to further simplify the model and obtain a
+removing `TCH`, and one of `TC` or `LDL` to further simplify the model and obtain a
 reduced set of independent features.
 
 Please see the
@@ -369,7 +383,7 @@ quantify the uncertainty by using bootstrap confidence intervals.
 .. image:: sphinx/source/_static/simulation_output.png
 
 We would conclude from the figure that higher values of `BMI` are associated with
-an increase in disease progression after one year, and that for a `BMI` of 29
+an increase in disease progression after one year, and that for a `BMI` of 28
 and above, there is a significant increase in disease progression after one year
 of at least 26 points.
 
@@ -447,10 +461,10 @@ or have a look at
 .. |azure_build| image:: https://dev.azure.com/gamma-facet/facet/_apis/build/status/BCG-Gamma.facet?repoName=BCG-Gamma%2Ffacet&branchName=develop
    :target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=7&_a=summary
 
-.. |azure_code_cov| image:: https://img.shields.io/azure-devops/coverage/gamma-facet/facet/7/develop.svg
+.. |azure_code_cov| image:: https://img.shields.io/azure-devops/coverage/gamma-facet/facet/7/2.0.x
    :target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=7&_a=summary
 
-.. |python_versions| image:: https://img.shields.io/badge/python-3.6|3.7|3.8-blue.svg
+.. |python_versions| image:: https://img.shields.io/badge/python-3.7|3.8|3.9-blue.svg
    :target: https://www.python.org/downloads/release/python-380/
 
 .. |code_style| image:: https://img.shields.io/badge/code%20style-black-000000.svg

diff --git a/RELEASE_NOTES.rst b/RELEASE_NOTES.rst
@@ -1,6 +1,67 @@
 Release Notes
 =============
 
+FACET 2.0
+---------
+
+2.0.0
+~~~~~
+
+``facet.data``
+^^^^^^^^^^^^^^
+
+- API: class :class:`.RangePartitioner` supports new optional arguments ``lower_bound``
+  and ``upper_bound`` in method :meth:`~.RangePartitioner.fit` and no longer accepts
+  them in the class initializer
+
+``facet.inspection``
+^^^^^^^^^^^^^^^^^^^^
+
+- API: :class:`.LearnerInspector` no longer uses learner crossfits and instead inspects
+  models using a single pass of SHAP calculations, usually leading to performance gains
+  of up to a factor of 50
+- API: return :class:`.LearnerInspector` matrix outputs as :class:`.Matrix` instances
+- API: diagonals of feature synergy, redundancy, and association matrices are now
+  ``nan`` instead of 1.0
+- API: the leaf order of :class:`.LinkageTree` objects generated by
+  ``feature_…_linkage`` methods of :class:`.LearnerInspector` is now the same as the
+  row and column order of :class:`.Matrix` objects returned by the corresponding
+  ``feature_…_matrix`` methods of :class:`.LearnerInspector`, minimizing the distance
+  between adjacent leaves
+  The old sorting behaviour of FACET 1 can be restored using method
+  :meth:`.LinkageTree.sort_by_weight`
+
+``facet.selection``
+^^^^^^^^^^^^^^^^^^^
+
+- API: :class:`.ModelSelector` replaces FACET 1 class ``LearnerRanker``, and now
+  supports any CV searcher that supports `scikit-learn`'s CV search API, including
+  `scikit-learn`'s native searchers such as :class:`.GridSearchCV` or
+  :class:`.RandomizedSearchCV`
+- API: new classes :class:`.ParameterSpace` and :class:`MultiParameterSpace` offer an
+  a more convenient and robust mechanism for declaring options or distributions for
+  hyperparameter tuning
+
+``facet.simulation``
+^^^^^^^^^^^^^^^^^^^^
+
+- API: simulations no longer depend on learner crossfits and instead are carried out
+  as a single pass on the full dataset, using the *standard error of mean predictions*
+  to obtain confidence intervals that less conservative yet more realistic
+- VIZ: minor tweaks to simulation plots and reports generated by
+  :class:`.SimulationDrawer`
+
+``facet.validation``
+^^^^^^^^^^^^^^^^^^^^
+
+- API: remove class ``FullSampleValidator``
+
+Other
+^^^^^
+
+- API: class ``LearnerCrossfit`` is no longer used in FACET 2 and has been removed
+
+
 FACET 1.2
 ---------
 
@@ -10,11 +71,26 @@ fit the underlying crossfit.
 One example where this can be useful is to use only a recent period of a time series as
 the baseline of a simulation.
 
+
+1.2.2
+~~~~~
+
+- catch up with FACET 1.1.2
+
+
+1.2.1
+~~~~~
+
+- FIX: fix a bug in :class:`.UnivariateProbabilitySimulator` that was introduced in
+  FACET 1.2.0
+- catch up with FACET 1.1.1
+
+
 1.2.0
 ~~~~~
 
 - BUILD: added support for *sklearndf* 1.2 and *scikit-learn* 0.24
-- API: new optional parameter `subsample` in method 
+- API: new optional parameter ``subsample`` in method
   :meth:`.BaseUnivariateSimulator.simulate_feature` can be used to specify a subsample
   to be used in the simulation (but simulating using a crossfit based on the full
   sample)
@@ -26,27 +102,39 @@ FACET 1.1
 FACET 1.1 refines and enhances the association/synergy/redundancy calculations provided
 by the :class:`.LearnerInspector`.
 
+
+1.1.2
+~~~~~
+
+- DOC: use a downloadable dataset in the `getting started` notebook
+- FIX: import :mod:`catboost` if present, else create a local module mockup
+- FIX: correctly identify if ``sample_weights`` is undefined when re-fitting a model
+  on the full dataset in a :class:`.LearnerCrossfit`
+- BUILD: relax package dependencies to support any `numpy` version 1.`x` from 1.16
+
+
 1.1.1
 ~~~~~
 
 - DOC: add reference to FACET research paper on the project landing page
+- FIX: correctly count positive class frequency in UnivariateProbabilitySimulator
 
 
 1.1.0
 ~~~~~
 
 - API: SHAP interaction vectors can (in part) also be influenced by redundancy among
-  features. This can inflate quantificatios of synergy, especially in cases where two
+  features. This can inflate quantifications of synergy, especially in cases where two
   variables are highly redundant. FACET now corrects interaction vectors for redundancy
   prior to calculating synergy. Technically we ensure that each interaction vector is
   orthogonal w.r.t the main effect vectors of both associated features.
 - API: FACET now calculates synergy, redundancy, and association separately for each
   model in a crossfit, then returns the mean of all resulting matrices. This leads to a
   slight increase in accuracy, and also allows us to calculate the standard deviation
   across matrices as an indication of confidence for each calculated value.
-- API: Method :meth:`.LernerInspector.shap_plot_data` now returns SHAP values for the
+- API: Method :meth:`.LearnerInspector.shap_plot_data` now returns SHAP values for the
   positive class of binary classifiers.
-- API: Increase efficiency of :class:`.LearnerRanker` parallelization by adopting the
+- API: Increase efficiency of :class:`.ModelSelector` parallelization by adopting the
   new :class:`pytools.parallelization.JobRunner` API provided by :mod:`pytools`
 - BUILD: add support for :mod:`shap` 0.38 and 0.39
 
@@ -57,7 +145,8 @@ FACET 1.0
 1.0.3
 ~~~~~
 
-- FIX: restrict package requirements to *gamma-pytools* 1.0.* and *sklearndf* 1.0.x, since FACET 1.0 is not compatible with *gamma-pytools* 1.1.* 
+- FIX: restrict package requirements to *gamma-pytools* 1.0.* and *sklearndf* 1.0.x,
+  since FACET 1.0 is not compatible with *gamma-pytools* 1.1.*
 
 1.0.2
 ~~~~~