-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use tfdf.builder.CARTBuilder to build/train a decision tree by hand #184
Comments
I believe that there could be a fundamental misundestanding. CART trees are not generated the same way neural networks are; in fact, the two algorithms are fundamentally different. For neural networks, the (very simplified) process is:
For CART trees, the process is (see Wikipedia for much more details
This means that your assumption that the If you want to fit a model to a dataset, just ignore the CARTBuilder, create a model and fit it directly as explained in the beginner tutorial |
Hi @rstz , Thank you very much for the explanation, I should have put it this way: I am aligned with you that the builder is perfect for To this end, the
Am I missing something? |
Thank you for the clarification! Probability and number of examples are used by the decision tree to output confidence estimates along with the predictions, as well as e.g. tree pruning algorithms. This can be useful for certain applications, but does not impact "raw" model inference. If the tree is fit using the CART algorithm, those are set automatically based on the statistics of the training dataset. If you're building the tree manually and have no way of computing these values, you may just set them to arbitrary values (e.g. always probability 1 for the class you want, and number_of_samples=10). |
Hi @rstz, Thanks for the reply, I want to seek more clarity on the APIs. Given dataset, mostly we do not know the |
Unfortunately, TF-DF does not offer a specific API for this :( You're probably able to bootstrap this with a bit of code: Say you have
If you're able to get it to work, we'd be very happy to include this in our examples - just submit a PR! |
I think what you proprosed makes sense for real business use cases, I am working on this. On top of that, these a few days I have been thinking of the best practice of this
sample_tree = inspector.extract_tree(tree_idx=0)
# Create the model builder
model_trial_idx = 1
model_trial_idx += 1
model_path = f"/tmp/manual_model/{model_trial_idx}"
builder = tfdf.builder.CARTBuilder(
path=model_path,
objective=tfdf.py_tree.objective.ClassificationObjective(
label="species", classes=["Adelie", "Gentoo", "Chinstrap"])
)
builder.add_tree(sample_tree)
This makes more sense than building the builder from scratch, as it works better in general on top of a result based on ML model. Nonetheless, I went through the APIs, looks like some classes like sample_tree.root = NonLeafNode(
condition=NumericalHigherThanCondition(
feature=SimpleColumnSpec(name="bill_length_mm", type=ColumnType.NUMERICAL),
threshold=40.0,
missing_evaluation=False),
pos_child=NonLeafNode(
condition=CategoricalIsInCondition(
feature=SimpleColumnSpec(name="island",type=ColumnType.CATEGORICAL),
mask=["Dream", "Torgersen"],
missing_evaluation=False)
,pos_child=LeafNode(value=ProbabilityValue(probability=[0.8, 0.2], num_examples=10))
,neg_child=LeafNode(value=ProbabilityValue(probability=[0.1, 0.9], num_examples=20))
),
neg_child=LeafNode(value=ProbabilityValue(probability=[0.2, 0.8], num_examples=30))
) error is:
Also builder class does not have APIs to tweak each node, seemingly, something like below in the phase of using builder will also be helpful. builder.get_node[node_idx] = NonLeafNode(
condition=CategoricalIsInCondition(
feature=SimpleColumnSpec(name="island",type=ColumnType.CATEGORICAL),
mask=["Dream", "Torgersen"],
missing_evaluation=False)
,pos_child=LeafNode(value=ProbabilityValue(probability=[0.8, 0.2], num_examples=10))
,neg_child=LeafNode(value=ProbabilityValue(probability=[0.1, 0.9], num_examples=20))
) Perhaps I have missed some APIs, as seems the Python API doc is not exactly aligned with the code. But please advise on my proposed workflow and the questions regarding those setting methods. Appreciate it! |
Hi, thank you for providing additional details. I looked into the tree implementation and it seems like it's just the tree root that does not have a setter method - other properties (e.g. left_child, right_child, value etc.) can be modified through Python:
We might make the root modifiable for the next TF-DF version, but I don't believe that this will be blocking - since each attribute of the root can be modified. TF-DF does not offer a tree traversal API such as |
Hi @rstz , Thanks for the comments, adding a setter method for the root is necessary in some cases. Also, I tried to use your suggested way to adjust nodes, it worked, too. Nonetheless, I ran into an error that other people also met when trying to call a SavedModel, I added the comments in the issue, #136, too. Just copied the comments here: In my case, I am building a model using the builder way - add an existing tree to the builder to build a new model, and I can tweak nodes if necessary, the sample code of my workflow is as below: model = tfdf.keras.CartModel()
model.fit(x=dataset_tf)
inspector = model.make_inspector()
sample_tree = inspector.extract_tree(tree_idx=0)
# Create some alias
Tree = tfdf.py_tree.tree.Tree
SimpleColumnSpec = tfdf.py_tree.dataspec.SimpleColumnSpec
ColumnType = tfdf.py_tree.dataspec.ColumnType
# Nodes
NonLeafNode = tfdf.py_tree.node.NonLeafNode
LeafNode = tfdf.py_tree.node.LeafNode
# Conditions
NumericalHigherThanCondition = tfdf.py_tree.condition.NumericalHigherThanCondition
CategoricalIsInCondition = tfdf.py_tree.condition.CategoricalIsInCondition
# Leaf values
ProbabilityValue = tfdf.py_tree.value.ProbabilityValue
sample_tree = inspector.extract_tree(tree_idx=0)
print(sample_tree)
# Build a model
model_trial_idx = 1
# Create the model builder
model_trial_idx += 1
model_path = f"/tmp/manual_model/{model_trial_idx}"
builder = tfdf.builder.CARTBuilder(
path=model_path,
objective=tfdf.py_tree.objective.ClassificationObjective(
label="species", classes=["Adelie", "Gentoo", "Chinstrap"])
)
builder.add_tree(sample_tree)
builder.close()
manual_model = tf.keras.models.load_model(model_path)
manual_model.predict(dataset_tf) The error is as below: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-47-d9c72aa7e6da>](https://localhost:8080/#) in <cell line: 1>()
----> 1 manual_model.predict(dataset_tf)
1 frames
[/usr/local/lib/python3.10/dist-packages/keras/engine/training.py](https://localhost:8080/#) in tf__predict_function(iterator)
13 try:
14 do_return = True
---> 15 retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
16 except:
17 do_return = False
ValueError: in user code:
File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2169, in predict_function *
return step_function(self, iterator)
File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2155, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2143, in run_step **
outputs = model.predict_step(data)
File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2111, in predict_step
return self(x, training=False)
File "/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
ValueError: Could not find matching concrete function to call loaded from the SavedModel. Got:
Positional arguments (2 total):
* {'bill_depth_mm': <tf.Tensor 'inputs_2:0' shape=(None,) dtype=float32>,
'bill_length_mm': <tf.Tensor 'inputs_1:0' shape=(None,) dtype=float32>,
'body_mass_g': <tf.Tensor 'inputs_4:0' shape=(None,) dtype=float32>,
'flipper_length_mm': <tf.Tensor 'inputs_3:0' shape=(None,) dtype=float32>,
'island': <tf.Tensor 'inputs:0' shape=(None,) dtype=string>,
'sex': <tf.Tensor 'inputs_5:0' shape=(None,) dtype=string>,
'year': <tf.Tensor 'inputs_6:0' shape=(None,) dtype=int64>}
* False
Keyword arguments: {}
Expected these arguments to match one of the following 4 option(s):
Option 1:
Positional arguments (2 total):
* {'bill_depth_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='bill_depth_mm'),
'bill_length_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='bill_length_mm'),
'flipper_length_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='flipper_length_mm'),
'island': TensorSpec(shape=(None,), dtype=tf.string, name='island')}
* True
Keyword arguments: {}
Option 2:
Positional arguments (2 total):
* {'bill_depth_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='bill_depth_mm'),
'bill_length_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='bill_length_mm'),
'flipper_length_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='flipper_length_mm'),
'island': TensorSpec(shape=(None,), dtype=tf.string, name='island')}
* False
Keyword arguments: {}
Option 3:
Positional arguments (2 total):
* {'bill_depth_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='inputs_bill_depth_mm'),
'bill_length_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='inputs_bill_length_mm'),
'flipper_length_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='inputs_flipper_length_mm'),
'island': TensorSpec(shape=(None,), dtype=tf.string, name='inputs_island')}
* True
Keyword arguments: {}
Option 4:
Positional arguments (2 total):
* {'bill_depth_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='inputs_bill_depth_mm'),
'bill_length_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='inputs_bill_length_mm'),
'flipper_length_mm': TensorSpec(shape=(None,), dtype=tf.float32, name='inputs_flipper_length_mm'),
'island': TensorSpec(shape=(None,), dtype=tf.string, name='inputs_island')}
* False
Keyword arguments: {} Note: the dataset is identical to what is used to train the initial model. So it is very weird that it still ran into error when using the manual model to predict the dataset. This seemes to be issue of TensorFlow and people that used other TF packages also met this issue and the solution is install If this is the case, my version of packages is:
They are compatible but why this issue still remains? Appendix
|
Tl;Dr: You need to manually fix the input signature: def copy_model_sig(m):
"""
Copy the model signature to a new model.
"""
spec = m.save_spec()[0][0]
return lambda insp: spec
...
builder = tfdf.builder.CARTBuilder(
...
input_signature_example_fn=copy_model_sig(model)
) Hi, For some background, a TF-DF model is actually an Yggdrasil Decision Forests (YDF) model that is wrapped in a Keras model. The builder and the inspector both operate on the YDF model, since Keras itself does not have a concept of trees, forests, etc. Unfortunately, some information gets lost between the two formats. We are actively working on improving this, but it requires a major rewrite of parts of TF-DF / YDF. Here, the TF-DF model does not know what the correct input signature. Instead it just asks the YDF model for its signature and tries to guess a signature from there. Since YDF generally does not distinguish between different integer representations (everything is a float32) and may choose to ignore features that the model does not use, the signature it guesses often does not match what you fed to it through TF-DF. Looking at this code, it seems useful to integrate it somewhere in the library or, at least, more explicitly in the documentation - I'll think about that The full, fixed code is: model = tfdf.keras.CartModel()
model.fit(x=dataset_tf)
inspector = model.make_inspector()
def copy_model_sig(m):
"""
Copy the model signature to a new model.
"""
spec = m.save_spec()[0][0]
return lambda inspector: spec
sample_tree = inspector.extract_tree(tree_idx=0)
# Create some alias
Tree = tfdf.py_tree.tree.Tree
SimpleColumnSpec = tfdf.py_tree.dataspec.SimpleColumnSpec
ColumnType = tfdf.py_tree.dataspec.ColumnType
# Nodes
NonLeafNode = tfdf.py_tree.node.NonLeafNode
LeafNode = tfdf.py_tree.node.LeafNode
# Conditions
NumericalHigherThanCondition = tfdf.py_tree.condition.NumericalHigherThanCondition
CategoricalIsInCondition = tfdf.py_tree.condition.CategoricalIsInCondition
# Leaf values
ProbabilityValue = tfdf.py_tree.value.ProbabilityValue
sample_tree = inspector.extract_tree(tree_idx=0)
print(sample_tree)
# Build a model
model_trial_idx = 1
# Create the model builder
model_trial_idx += 1
model_path = f"/tmp/manual_model/{model_trial_idx}"
builder = tfdf.builder.CARTBuilder(
path=model_path,
objective=tfdf.py_tree.objective.ClassificationObjective(
label="species", classes=["Adelie", "Gentoo", "Chinstrap"]),
input_signature_example_fn=copy_model_sig(model)
)
builder.add_tree(sample_tree)
builder.close() Note that you can also manually modify the signature if necessary (copied from one of the test): def custom_model_input_signature(
inspector: inspector_lib.AbstractInspector,
) -> Any:
input_spec = keras.build_default_input_model_signature(inspector)
# Those features are stored as int64 in the dataset.
for feature_name in [
"age",
"capital_gain",
"capital_loss",
"education_num",
"fnlwgt",
"hours_per_week",
]:
input_spec[feature_name] = tf.TensorSpec(shape=[None], dtype=tf.int64)
return input_spec |
Hi @rstz , Thanks for the explanation and workaround, I appreciate your prompt reply these a few days! I tested and it worked! This thread is a great discussion, there are some work on my side to address my needs:
(flipper_length_mm >= 206.5; miss=False, score=0.5436033606529236)
├─(pos)─ (bill_depth_mm >= 17.649999618530273; miss=False, score=0.2061920464038849)
│ ├─(pos)─ ProbabilityValue([0.3333333333333333, 0.6666666666666666, 0.0],n=6.0) (idx=4)
│ └─(neg)─ ProbabilityValue([0.0, 0.0, 1.0],n=108.0) (idx=3)
└─(neg)─ (island in ['Biscoe', 'Torgersen']; miss=True, score=0.23399487137794495)
├─(pos)─ ProbabilityValue([0.9767441860465116, 0.0, 0.023255813953488372],n=86.0) (idx=2)
└─(neg)─ (bill_length_mm >= 42.349998474121094; miss=True, score=0.5646106004714966)
├─(pos)─ ProbabilityValue([0.03278688524590164, 0.9672131147540983, 0.0],n=61.0) (idx=1)
└─(neg)─ ProbabilityValue([0.9795918367346939, 0.02040816326530612, 0.0],n=49.0) (idx=0) to the format below along with some stats of the nodes flipper_length_mm >= 206.5 and bill_depth_mm >= 17.64999961853027;
flipper_length_mm >= 206.5 and bill_depth_mm < 17.64999961853027
.... Possible follow-up on the TF-DF side like you mentioned:
Again, thanks for your help, great discussion! TFDF long live! 🫡🫡🫡 |
Update on your prior point, @rstz
After applying I looked through the API, seems Nonetheless, the manual model seems to not be able to call
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
[<ipython-input-38-f1aae9bea4bf>](https://localhost:8080/#) in <cell line: 1>()
----> 1 manual_model.predict_get_leaves(dataset_tf)
AttributeError: 'InferenceCoreModel' object has no attribute 'predict_get_leaves'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-39-de7dfe55a161>](https://localhost:8080/#) in <cell line: 1>()
----> 1 manual_model.call_get_leaves(dataset_tf)
1 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/saved_model/function_deserialization.py](https://localhost:8080/#) in restored_function_body(*args, **kwargs)
259 """Calls a restored function or raises an error if no matching function."""
260 if not saved_function.concrete_functions:
--> 261 raise ValueError("Found zero restored functions for caller function.")
262 # This is the format of function.graph.structured_input_signature. At this
263 # point, the args and kwargs have already been canonicalized.
ValueError: Found zero restored functions for caller function. Both errors seem to be due to the building saving and model loading process? |
Hi, Unfortunately, TF-DF cannot serialize all its methods within the Keras model format. This means that some functions, including However, my initial proposal was to first create |
Hi @rstz , I managed to One thing I notice that Non-leaf nodes do not have |
Hi @Realvincentyuan, looking forward to your PRs! |
Hi @rstz , I have created functions for tree traversal, probability & number of samples reset and probability & number of samples refresh after getting the true results. I did not modify anything regarding the native classes that you built, these are mostly independent helper functions. I am not sure which part to add the functions for PRs, is it contrib or tools? I will document my functions and examples later and create a PR. ├── configure: Project configuration.
├── documentation: User and developer documentation. Contains the colabs.
├── examples: Collection of usage examples.
├── tensorflow_decision_forests: The library
│ ├── component: Utilities.
│ │ ├── builder: Create models "by hand".
│ │ ├── inspector: Inspection of structure and meta-data of models.
│ │ ├── model_plotter: Plotting of model tree structure.
│ │ ├── inspector: Inspection of structure and meta-data of models.
│ │ ├── py_tree: Representation of a decision tree as a python object.
│ │ └── tuner: TF-DF's own hyper-parameter tuner.
│ ├── contrib: Additional functionality outside the project's main scope.
│ ├── keras: Keras logic. Depends on tensorflow logic.
│ │ └── wrapper: Python code generator for Keras models.
│ │── tensorflow: TensorFlow logic.
│ │ └── ops: Custom C++ ops.
│ │ ├── inference: ... for inference.
│ │ └── training: ... for training.
│ └── test_data: Datasets for unit tests and benchmarks.
├── third_party: Bazel configuration for dependencies.
└── tools: Tools for the management of the project and code. Please note that I do not get a chance to refresh the values of non-leaf node, as the prediction of a model only looks at leaf node and I can only get the leaf assignment for each instance of input dataset. So by far I could not update the non-leaf node of models built by hand. I feel like in order to update each non-leaf node, it is required to simulate each split of the tree from root to each leaf, and in my case it is not necessary as I only need the literal performance of each leaf, in the PR, I will not include this feature. Any thoughts on this item? |
Hi, a subdirectory of |
Hi @rstz , I firstly added a post regarding those helper functions in my blog: https://vincentyuan.us/build-a-decision-tree-by-hand-with-tensorflow/ Please review this first when you get a chance. As you can see,
Looks like it is encouraged to add a test script, while I am not familiar with |
Hi, sorry for not getting back to you for a while. The article looks good! I wonder if the best strategy would be to just create an "External examples" file that links directly to your (and other) articles? |
Hi, No worries, I have not got a chance to make an exhaustive unit test so far. And I am thinking if possible, I will contribute to the tutorials notebook that you guys made, instead of adding an external link as the URL might get outdated. What do you think? |
Expectation
Use the
tfdf.builder.CARTBuilder
to build a decision tree structure and train it with the literal dataset, and optimize the tree structure per the performance.The process is like manually replicate the training process using
tfdf.keras.CartModel
but the benefit is that I can adjust the tree structure per needs, not only focusing on the model performance, which will be helpful ifintuitive rules
are needed.Sample code
I tried to use the
tfdf.builder.CARTBuilder
to build the structure and fit/predict but the results are not as expected as the fitting process does not lead to change of the prediction of leaves.Below are some sample code with a sample dataset running in
Colab
Questions
manual_model
is a completely static model.I assume that the
tfdf.builder.CARTBuilder
is used to build a shell, and the prediction of each node can be reflected after fitting/prediction. I am very confused why it requires me to define the value in the leaf and the value remains the same after fitting/prediction, did I miss anything?What is the best practice to use
tfdf.builder.CARTBuilder
to build a decision tree by hand on earth?Reference:
The text was updated successfully, but these errors were encountered: