Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Showing Extra Class Label #102

Open
laneciar opened this issue Apr 23, 2022 · 4 comments
Open

Model Showing Extra Class Label #102

laneciar opened this issue Apr 23, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@laneciar
Copy link

Hello,

When training my model for some reason it is coming up with an additional class. For example I currently have the following as classes [1, 2, 3, 4, 5] but when analyzing the tree using plot model it shows the following:

image

What is class 0 and why does this show up? Obviously 0% of the dataset has it but why is it here in the first place, i also believe it shows up when outputting the summary of the model. Even when turing my y label array to a set it gives the following:

image

Which comes from this snippet:

       temp = list(zip(self.x_train, self.y_train))
        random.shuffle(temp)
        x_train, y_train = zip(*temp)
        my_set = set(y_train)
        print(my_set)
        train_data = self.random_forest.make_tf_dataset(
            np.array(x_train), np.array(y_train)
        )

        # print(len(list(train_data.as_numpy_iterator())))
        self.model_6.fit(train_data, verbose=1)

Any idea whats going on?

@Cheril311
Copy link
Contributor

@laneciar can you please tell me what your classes are and what loss function and metric are you using?

@laneciar
Copy link
Author

@Cheril311 My classes are what is in the picture above, {1, 2, 3, 4, 5} are each associated with an x row, as for loss function and metric its just the default that the Random Forest model uses, I don't specify one.

@laneciar
Copy link
Author

@Cheril311 Here is some source code:

Random Forest, i use the rf_model for training and the second returned model for evaluating and predicting.

ef create_single_model(self):
        input_features = tf.keras.Input(shape=(self.num_features,))

        # preprocessor = tf.keras.layers.Dense(self.num_features, activation=tf.nn.relu6)
        # preprocess_features = preprocessor(input_features)

        rf_model_1 = tfdf.keras.RandomForestModel(
            verbose=1,
            task=tfdf.keras.Task.CLASSIFICATION,
            num_trees=self.num_of_trees,
            max_depth=32,
            # hyperparameter_template="benchmark_rank1@v1",
            # bootstrap_size_ratio=1.0,  # Optimal at 1  0.6470000147819519
            categorical_algorithm="CART",  # CART and RANDOM provide same accuracy 0.6470000147819519
            growing_strategy="LOCAL",  # LOCAL signficiantly better 0.6470000147819519
            # honest=False,  # honest True is slightly better 0.6470000147819519
            # max_depth=5,  # Caps at 32 slightly better  0.6480000019073486
            # min_examples=5,  # Best at 5  0.6480000019073486
            # missing_value_policy="LOCAL_IMPUTATION",  # No change .6480000019073486
            sorting_strategy="PRESORT",  # No change .6480000019073486
            sparse_oblique_normalization="MIN_MAX",  # Signficiantly helps 0.6850000023841858
            # sparse_oblique_num_projections_exponent=2.0,  # Crashes when above 2
            # sparse_oblique_weights="BINARY",  # Slightly better
            split_axis="SPARSE_OBLIQUE",  # Slightly better
            # winner_take_all=True,  # Slightly better
        )
        out = rf_model_1(input_features)

        model = tf.keras.models.Model(input_features, out)

        return rf_model_1, model

Training: I shuffle the x and y data so the labels are mixed up and not in order

tf.keras.utils.plot_model(
            self.single_model,
            to_file="./info/single_arch/model_test.png",
            show_shapes=True,
            show_layer_names=True,
        )

        temp = list(zip(self.x_train, self.y_train))
        random.shuffle(temp)
        x_train, y_train = zip(*temp)
        train_data = self.random_forest.make_tf_dataset(
            np.array(x_train), np.array(y_train)
        )

        # print(len(list(train_data.as_numpy_iterator())))
        self.model_6.fit(train_data, verbose=1)

        self.model_6.compile(["accuracy"])
        validation_data = self.random_forest.make_tf_dataset(self.x_test, self.y_test)
        evaluation_df6_only = self.model_6.evaluate(validation_data, return_dict=True)

        with open("./info/single_arch/model_6.html", "w") as f:
            f.write(
                tfdf.model_plotter.plot_model(self.model_6, tree_idx=0, max_depth=10)
            )
        print("Accuracy (D6 only): ", evaluation_df6_only["accuracy"])

Hope this helps, let me know if you want anything else.

@achoum
Copy link
Collaborator

achoum commented Jun 22, 2022

Hi Lanceciar,

This class 0 is an artifact of the way classes are handled internally. This class 0 represents the out-of-vocabulary values. However, since out-of-vocabulary values are not permitted for labels, it is always 0.

Thanks for the heads-up. We will resolve it :).

@rstz rstz added the bug Something isn't working label Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants