Skip to content

Commit

Permalink
Merge pull request #336 from BCG-Gamma/dev/2.0.dev1
Browse files Browse the repository at this point in the history
  • Loading branch information
j-ittner authored Apr 22, 2022
2 parents 819f0ad + ede0d22 commit 5e9e8a6
Show file tree
Hide file tree
Showing 45 changed files with 3,713 additions and 4,149 deletions.
12 changes: 11 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ repos:
- id: isort

- repo: https://github.com/psf/black
rev: 20.8b1
rev: 22.3.0
hooks:
- id: black
language_version: python3
Expand All @@ -26,3 +26,13 @@ repos:
- id: check-added-large-files
- id: check-json
- id: check-yaml

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.931
hooks:
- id: mypy
files: src/
additional_dependencies:
- numpy>=1.22
- gamma-pytools>=2.0.dev8,<3a
- sklearndf>=2.0.dev3,<3a
56 changes: 35 additions & 21 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,14 +90,14 @@ Enhanced Machine Learning Workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To demonstrate the model inspection capability of FACET, we first create a
pipeline to fit a learner. In this simple example we use the
`diabetes dataset <https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt>`__
pipeline to fit a learner. In this simple example we will use the
`diabetes dataset <https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data>`__
which contains age, sex, BMI and blood pressure along with 6 blood serum
measurements as features. A transformed version of this dataset is also available
on scikit-learn
measurements as features. This dataset was used in this
`publication <https://statweb.stanford.edu/~tibs/ftp/lars.pdf>`__.
A transformed version of this dataset is also available on scikit-learn
`here <https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset>`__.


In this quickstart we will train a Random Forest regressor using 10 repeated
5-fold CV to predict disease progression after one year. With the use of
*sklearndf* we can create a *pandas* DataFrame compatible workflow. However,
Expand All @@ -119,8 +119,22 @@ hyperparameter configurations and even multiple learners with the `LearnerRanker
from facet.data import Sample
from facet.selection import LearnerRanker, LearnerGrid
# load the diabetes dataset
diabetes_df = pd.read_csv('diabetes_quickstart.csv')
# declaring url with data
data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'
#importing data from url
diabetes_df = pd.read_csv(data_url, delimiter='\t').rename(
# renaming columns for better readability
columns={
'S1': 'TC', # total serum cholesterol
'S2': 'LDL', # low-density lipoproteins
'S3': 'HDL', # high-density lipoproteins
'S4': 'TCH', # total cholesterol/ HDL
'S5': 'LTG', # lamotrigine level
'S6': 'GLU', # blood sugar level
'Y': 'Disease_progression' # measure of progress since 1yr of baseline
}
)
# create FACET sample object
diabetes_sample = Sample(observations=diabetes_df, target_name="Disease_progression")
Expand Down Expand Up @@ -236,10 +250,10 @@ The key global metrics for each pair of features in a model are:

For any feature pair (A, B), the first feature (A) is the row, and the second
feature (B) the column. For example, looking across the row for `LTG` (Lamotrigine)
there is relatively minimal synergy (≤1%) with other features in the model.
However, looking down the column for `LTG` (i.e., perspective of other features
in a pair with `LTG`) we find many features (the rows) are synergistic (up to 27%)
with `LTG`. We can conclude that:
there is hardly any synergy with other features in the model (≤ 1%).
However, looking down the column for `LTG` (i.e., from the perspective of other features
relative with `LTG`) we find that many features (the rows) are aided by synergy with
with `LTG` (up to 27% in the case of LDL). We conclude that:

- `LTG` is a strongly autonomous feature, displaying minimal synergy with other
features for predicting disease progression after one year.
Expand All @@ -248,7 +262,7 @@ with `LTG`. We can conclude that:

High synergy between pairs of features must be considered carefully when investigating
impact, as the values of both features jointly determine the outcome. It would not make
much sense to consider `TC` (T-Cells) without the context provided by `LDL` given close
much sense to consider `LDL` without the context provided by `LTG` given close
to 27% synergy of `LDL` with `LTG` for predicting progression after one year.

**Redundancy**
Expand All @@ -267,12 +281,12 @@ For any feature pair (A, B), the first feature (A) is the row, and the second fe
(B) the column. For example, if we look at the feature pair (`LDL`, `TC`) from the
perspective of `LDL` (Low-Density Lipoproteins), then we look-up the row for `LDL`
and the column for `TC` and find 38% redundancy. This means that 38% of the information
in `LDL` is duplicated with `TC` to predict disease progression after one year. This
in `LDL` to predict disease progression is duplicated in `TC`. This
redundancy is the same when looking "from the perspective" of `TC` for (`TC`, `LDL`),
but need not be symmetrical in all cases (see `LTG` vs. `TSH`).
but need not be symmetrical in all cases (see `LTG` vs. `TCH`).

If we look at `TSH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but
the same does not hold between `LTG` and `HDL` – meaning `TSH` shares different
If we look at `TCH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but
the same does not hold between `LTG` and `HDL` – meaning `TCH` shares different
information with each of the two features.


Expand Down Expand Up @@ -302,9 +316,9 @@ Let's look at the example for redundancy.
:width: 600

Based on the dendrogram we can see that the feature pairs (`LDL`, `TC`)
and (`HDL`, `TSH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`
and (`HDL`, `TCH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`
have the highest importance. As potential next actions we could explore the impact of
removing `TSH`, and one of `TC` or `LDL` to further simplify the model and obtain a
removing `TCH`, and one of `TC` or `LDL` to further simplify the model and obtain a
reduced set of independent features.

Please see the
Expand Down Expand Up @@ -369,7 +383,7 @@ quantify the uncertainty by using bootstrap confidence intervals.
.. image:: sphinx/source/_static/simulation_output.png

We would conclude from the figure that higher values of `BMI` are associated with
an increase in disease progression after one year, and that for a `BMI` of 29
an increase in disease progression after one year, and that for a `BMI` of 28
and above, there is a significant increase in disease progression after one year
of at least 26 points.

Expand Down Expand Up @@ -447,10 +461,10 @@ or have a look at
.. |azure_build| image:: https://dev.azure.com/gamma-facet/facet/_apis/build/status/BCG-Gamma.facet?repoName=BCG-Gamma%2Ffacet&branchName=develop
:target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=7&_a=summary

.. |azure_code_cov| image:: https://img.shields.io/azure-devops/coverage/gamma-facet/facet/7/develop.svg
.. |azure_code_cov| image:: https://img.shields.io/azure-devops/coverage/gamma-facet/facet/7/2.0.x
:target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=7&_a=summary

.. |python_versions| image:: https://img.shields.io/badge/python-3.6|3.7|3.8-blue.svg
.. |python_versions| image:: https://img.shields.io/badge/python-3.7|3.8|3.9-blue.svg
:target: https://www.python.org/downloads/release/python-380/

.. |code_style| image:: https://img.shields.io/badge/code%20style-black-000000.svg
Expand Down
99 changes: 94 additions & 5 deletions RELEASE_NOTES.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,67 @@
Release Notes
=============

FACET 2.0
---------

2.0.0
~~~~~

``facet.data``
^^^^^^^^^^^^^^

- API: class :class:`.RangePartitioner` supports new optional arguments ``lower_bound``
and ``upper_bound`` in method :meth:`~.RangePartitioner.fit` and no longer accepts
them in the class initializer

``facet.inspection``
^^^^^^^^^^^^^^^^^^^^

- API: :class:`.LearnerInspector` no longer uses learner crossfits and instead inspects
models using a single pass of SHAP calculations, usually leading to performance gains
of up to a factor of 50
- API: return :class:`.LearnerInspector` matrix outputs as :class:`.Matrix` instances
- API: diagonals of feature synergy, redundancy, and association matrices are now
``nan`` instead of 1.0
- API: the leaf order of :class:`.LinkageTree` objects generated by
``feature_…_linkage`` methods of :class:`.LearnerInspector` is now the same as the
row and column order of :class:`.Matrix` objects returned by the corresponding
``feature_…_matrix`` methods of :class:`.LearnerInspector`, minimizing the distance
between adjacent leaves
The old sorting behaviour of FACET 1 can be restored using method
:meth:`.LinkageTree.sort_by_weight`

``facet.selection``
^^^^^^^^^^^^^^^^^^^

- API: :class:`.ModelSelector` replaces FACET 1 class ``LearnerRanker``, and now
supports any CV searcher that supports `scikit-learn`'s CV search API, including
`scikit-learn`'s native searchers such as :class:`.GridSearchCV` or
:class:`.RandomizedSearchCV`
- API: new classes :class:`.ParameterSpace` and :class:`MultiParameterSpace` offer an
a more convenient and robust mechanism for declaring options or distributions for
hyperparameter tuning

``facet.simulation``
^^^^^^^^^^^^^^^^^^^^

- API: simulations no longer depend on learner crossfits and instead are carried out
as a single pass on the full dataset, using the *standard error of mean predictions*
to obtain confidence intervals that less conservative yet more realistic
- VIZ: minor tweaks to simulation plots and reports generated by
:class:`.SimulationDrawer`

``facet.validation``
^^^^^^^^^^^^^^^^^^^^

- API: remove class ``FullSampleValidator``

Other
^^^^^

- API: class ``LearnerCrossfit`` is no longer used in FACET 2 and has been removed


FACET 1.2
---------

Expand All @@ -10,11 +71,26 @@ fit the underlying crossfit.
One example where this can be useful is to use only a recent period of a time series as
the baseline of a simulation.


1.2.2
~~~~~

- catch up with FACET 1.1.2


1.2.1
~~~~~

- FIX: fix a bug in :class:`.UnivariateProbabilitySimulator` that was introduced in
FACET 1.2.0
- catch up with FACET 1.1.1


1.2.0
~~~~~

- BUILD: added support for *sklearndf* 1.2 and *scikit-learn* 0.24
- API: new optional parameter `subsample` in method
- API: new optional parameter ``subsample`` in method
:meth:`.BaseUnivariateSimulator.simulate_feature` can be used to specify a subsample
to be used in the simulation (but simulating using a crossfit based on the full
sample)
Expand All @@ -26,27 +102,39 @@ FACET 1.1
FACET 1.1 refines and enhances the association/synergy/redundancy calculations provided
by the :class:`.LearnerInspector`.


1.1.2
~~~~~

- DOC: use a downloadable dataset in the `getting started` notebook
- FIX: import :mod:`catboost` if present, else create a local module mockup
- FIX: correctly identify if ``sample_weights`` is undefined when re-fitting a model
on the full dataset in a :class:`.LearnerCrossfit`
- BUILD: relax package dependencies to support any `numpy` version 1.`x` from 1.16


1.1.1
~~~~~

- DOC: add reference to FACET research paper on the project landing page
- FIX: correctly count positive class frequency in UnivariateProbabilitySimulator


1.1.0
~~~~~

- API: SHAP interaction vectors can (in part) also be influenced by redundancy among
features. This can inflate quantificatios of synergy, especially in cases where two
features. This can inflate quantifications of synergy, especially in cases where two
variables are highly redundant. FACET now corrects interaction vectors for redundancy
prior to calculating synergy. Technically we ensure that each interaction vector is
orthogonal w.r.t the main effect vectors of both associated features.
- API: FACET now calculates synergy, redundancy, and association separately for each
model in a crossfit, then returns the mean of all resulting matrices. This leads to a
slight increase in accuracy, and also allows us to calculate the standard deviation
across matrices as an indication of confidence for each calculated value.
- API: Method :meth:`.LernerInspector.shap_plot_data` now returns SHAP values for the
- API: Method :meth:`.LearnerInspector.shap_plot_data` now returns SHAP values for the
positive class of binary classifiers.
- API: Increase efficiency of :class:`.LearnerRanker` parallelization by adopting the
- API: Increase efficiency of :class:`.ModelSelector` parallelization by adopting the
new :class:`pytools.parallelization.JobRunner` API provided by :mod:`pytools`
- BUILD: add support for :mod:`shap` 0.38 and 0.39

Expand All @@ -57,7 +145,8 @@ FACET 1.0
1.0.3
~~~~~

- FIX: restrict package requirements to *gamma-pytools* 1.0.* and *sklearndf* 1.0.x, since FACET 1.0 is not compatible with *gamma-pytools* 1.1.*
- FIX: restrict package requirements to *gamma-pytools* 1.0.* and *sklearndf* 1.0.x,
since FACET 1.0 is not compatible with *gamma-pytools* 1.1.*

1.0.2
~~~~~
Expand Down
Loading

0 comments on commit 5e9e8a6

Please sign in to comment.