Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementation of cal_weighted_quantiles #62

Closed
felix7602 opened this issue Jul 2, 2024 · 4 comments
Closed

implementation of cal_weighted_quantiles #62

felix7602 opened this issue Jul 2, 2024 · 4 comments

Comments

@felix7602
Copy link

Hi @reidjohnson , thank you for patiently answering my questions multiple times.

For the model initialization, I set max_samples_leaf=None and in the predict function,
set weighted_quantile=True, weighted_leaves=True, aggregate_leaves_first=True

I looked up all the training samples in the corresponding leaf nodes (below):
企业微信截图_17199114648978

and got the weight for each value (below):
企业微信截图_17199114029320

From the perspective of model implementation, could you please tell me how to obtain the correct value with the interpolation method set to 'linear' based on the sorted data and weights above?

I have already reviewed your source code, but I did not understand it. I think the inputs and weights parameters received by the calc_weighted_quantile method in the source code might be different from what I have in mind (as shown in the second screenshot).

@felix7602
Copy link
Author

This is a paragraph in your introduction file, and the process I mentioned above is to replicate this procedure.

企业微信截图_17199151451910

@reidjohnson
Copy link
Member

@felix7602 No problem, thanks for your continued interest in and feedback on the package!

Here's an example implementation of a custom predict function that I believe accomplishes what you want. It produces output identical to the model predict method with the parameters you specified and uses the calc_weighted_quantile function with the expected inputs and weights. Feel free to follow up with any further questions.

import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from quantile_forest._quantile_forest_fast import calc_weighted_quantile
from sklearn import datasets

X, y = datasets.fetch_california_housing(return_X_y=True)
X = X[:100]
y = y[:100]

quantiles = [0.025, 0.5, 0.975]
interpolation = "linear"

model = RandomForestQuantileRegressor(max_samples_leaf=None, random_state=0)
model.fit(X, y)
y_pred = model.predict(
    X,
    quantiles=quantiles,
    interpolation=interpolation,
    weighted_quantile=True,
    weighted_leaves=True,
    aggregate_leaves_first=True,
)


def custom_predict(X, quantiles, model):
    y_train = np.asarray(model.forest_.y_train)
    y_train_leaves = np.asarray(model.forest_.y_train_leaves)

    X_leaves = model.apply(X)

    n_quantiles = len(quantiles)
    n_samples = X_leaves.shape[0]
    n_trees = X_leaves.shape[1]

    n_outputs = len(y_train)
    n_train = len(y_train[0])
    max_idx = y_train_leaves.shape[3]

    preds = np.full((n_samples, n_outputs, n_quantiles), np.nan, dtype=np.float64)

    for i in range(n_samples):
        n_leaf_samples = np.empty(n_trees)

        n_total_samples = 0
        n_total_trees = 0
        for j in range(n_trees):
            n_leaf_samples[j] = 0
            for k in range(max_idx):
                if y_train_leaves[j, X_leaves[i, j], 0, k] != 0:
                    n_leaf_samples[j] += 1
            n_total_samples += n_leaf_samples[j]
            n_total_trees += 1

        for j in range(n_outputs):
            train_indices = []
            train_weights = []

            # Accumulate training indices across leaves for each tree.
            for k in range(n_trees):
                train_indices.extend(y_train_leaves[k, X_leaves[i, k], j, :])

            for k in range(n_trees):
                train_weight = 0
                if n_leaf_samples[k] > 0:
                    train_weight = 1 / n_leaf_samples[k]
                    train_weight *= n_total_samples
                    train_weight /= n_total_trees
                train_weights.extend([train_weight] * max_idx)

            # Reset leaf weights for all training indices to 0.
            leaf_weights = np.zeros(n_train)

            # Sum the weights/counts for each training index.
            for l in range(len(train_indices)):
                train_idx = train_indices[l]
                train_wgt = train_weights[l]
                if train_idx != 0:
                    leaf_weights[train_idx - 1] += train_wgt

            # Calculate quantiles.
            pred = calc_weighted_quantile(
                y_train[j],
                leaf_weights,
                quantiles,
                interpolation.encode(),
                issorted=True,
            )

            preds[i, j, :] = pred

    if preds.shape[2] == 1:
        preds = np.squeeze(preds, axis=2)

    if preds.shape[1] == 1:
        preds = np.squeeze(preds, axis=1)

    return preds


print(np.all(y_pred == custom_predict(X, quantiles, model)))

@felix7602
Copy link
Author

Dear @reidjohnson,

I am writing to express my heartfelt gratitude for your invaluable assistance in understanding the internal workings of the quantile regression forest model. As a student, your prompt and insightful responses have been instrumental in advancing my research.

Your willingness to share your expertise and the time you have dedicated to addressing my questions, often with remarkable promptness, have significantly contributed to my comprehension and progress. I am genuinely grateful for your generosity and support.

I want you to know that your help means a lot to me. It truly touches my heart. Thank you once again for your kindness and timely guidance.

Sincerely,
Felix

@reidjohnson
Copy link
Member

Thank you for the note, very glad to be of help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants