Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto assign_causal_mechanisms is taking so much time in gcm #1214

Closed
Abu-thahir opened this issue Jun 24, 2024 · 12 comments
Closed

Auto assign_causal_mechanisms is taking so much time in gcm #1214

Abu-thahir opened this issue Jun 24, 2024 · 12 comments
Labels
question Further information is requested stale

Comments

@Abu-thahir
Copy link

Abu-thahir commented Jun 24, 2024

@bloebp @amit-sharma I tried to run the Online Sales Shop example, which is available here: https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_online_shop.html.

auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD); print(auto_assignment_summary)

This code method is running for hours and hours, with no output. Why is this so? Is this the intended behaviour? Also, the gcm.evaluate_causal_model method isn't working for me.

I also have a question: if I have a causal graph, should I explicitly apply causal mechanisms to each node before using it in gcm? If so, what are all the possible distributions that I should be able to set? Is there a reference for understanding causal mechanisms?

Version information:

  • DoWhy version 0.11.1
@Abu-thahir Abu-thahir added the question Further information is requested label Jun 24, 2024
@bloebp
Copy link
Member

bloebp commented Jun 24, 2024

Hi, I think someone else has reported a similar issue. It was due to using Python 3.12 (DoWhy only supports versions smaller than 3.12, e.g., 3.11) and the installed scikit-learn version. Can you double-check if you have DoWhy 0.11.1 installed (with Python 3.12, it will fall back to 0.8, I think).

Generally auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD) in this example should be quite fast (probably under 20 seconds). Can you try to uninstall scikit-learn and re-install it again (or upgrade it)?

Also, the gcm.evaluate_causal_model method isn't working for me.

Do you have an error message? If the method was not found, then it is most likely due to having an older DoWhy version installed.

I also have a question: if I have a causal graph, should I explicitly apply causal mechanisms to each node before using it in gcm? If so, what are all the possible distributions that I should be able to set? Is there a reference for understanding causal mechanisms?

Normally, each node requires a causal mechanism to describe its data generation process. The assign_causal_mechanisms function aims to automate this process with some "heuristics", so you don't have to do it manually. You can check

for more information about customizing them if you want to assign them manually. Generally, you can either prepare your own model or use an existing wrapper to, e.g., assign any SciPy distribution to root nodes or regression/classification models for (additive noise models in) non-root nodes. The example notebook https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_rca_microservice_architecture.html shows some reasoning process about selecting the models manually.

@Abu-thahir
Copy link
Author

@bloebp Thank you for helping me out. I resolved the difficulties by downgrading my Python version. However, there is another issue with OneHotEncoder in dowhy's util package.

Issue 1 :

45     if drop_first:
     46         drop = "first"
---> 47     encoder = OneHotEncoder(drop=drop, sparse=False)  # NB sparse renamed to sparse_output in sklearn 1.2+
     49     encoded_data = encoder.fit_transform(data_to_encode)
     51 else:  # Use existing encoder

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

The parameter for OneHotEncoder (sparse) in Scikit-learn has been renamed "sparse_output". This must be modified; else, an encoding error will occur.

I attempted to downgrade Scikit-learn below 1.2.0, but encountered other dependence wheel package difficulties. So i think its must to make this change!

@bloebp
Copy link
Member

bloebp commented Jun 28, 2024

I think I found the code piece related to this. I am wondering, since this is not part of the GCM package, do you manually encode the categorical variables? If you use gcm.fit it would already automatically encode categorical data accordingly, no need to do this manually.

@bloebp
Copy link
Member

bloebp commented Jun 28, 2024

Opened a fix PR: #1219

@Abu-thahir
Copy link
Author

I am wondering, since this is not part of the GCM package, do you manually encode the categorical variables? If you use gcm.fit it would already automatically encode categorical data accordingly, no need to do this manually.

@bloebp ,
I haven't used gcm, but I've worked extensively with the dowhy API. In my use case, I want to perform causal analysis on many treatments over the same outcome variables, and the system should be scalable enough to handle concurrent queries.
For example, I'm looking for the causal effect values for a treatment variable named Campaign Name, which has 6 to 7 campaigns, vs the outcome variable "Sales". So I OneHotEncode the data manually and limit the number of unique values for variables to ten because onehotencoding causes the curse of dimensionality.

I am curious if it is possible to serve requests concurrently if I provide the whole categorical data with 50-100 unique values directly to fit. Also, would this be scalable, or will it cause large memory consumption difficulties as dimensionality increases?

@bloebp
Copy link
Member

bloebp commented Jun 28, 2024

Ah ok got it. In this case, since you have particular target variable in mind, maybe you can check alternative encoding methods, such as CatBoostEncoder. We have an implementation here: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/util/catboost_encoder.py#L37

Basically what you can try is:

my_encoder = CatBoostEncoder()
df['MyCategoricalColumn'] = my_encoder.fit_transform(X=df['MyCategoricalColumn'].to_numpy().reshape(-1), Y=df['MyTargetVariable'].to_numpy().reshape(-1))

@Abu-thahir
Copy link
Author

However, CatBooEncoder will encode the column called Campaign_Name within the same column, which is a categorical treatment variable, but what i need here is to get the ATE values for each campaign over the outcome variable "Sales".

If I do onehotencoding, I may retrieve the ATE value for each campaign because it is viewed as another treatment variable against the outcome "Sales".

Is it possible to obtain the ATE values for each Campaign in the Campaign_Name variable using catBoost in GCM?

@bloebp
Copy link
Member

bloebp commented Jun 28, 2024

Ah ok, yea the intervention value becomes rather abstract if you do a catboost encoding, while you still have a clear interpretation in one-hot-encodings.

So, in case of GCM, you can explicitly set the campaign value and see the effect, the transformation (e.g. catboost) would then happen internally. You can check this: https://www.pywhy.org/dowhy/v0.11.1/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_gcm.html

Basically, you need to compare two reference treatments, like

gcm.average_causal_effect(causal_model,
                         'Sales',
                         interventions_alternative={'Campaigns': lambda x: 'MyFirstCampaign'},
                         interventions_reference={'Campaigns': lambda x: 'MySecondCampaign'},
                         num_samples_to_draw=1000)

Generally, using DML for effect estimation might be more robust than a GCM, but you can give it a shot.

@Abu-thahir
Copy link
Author

Abu-thahir commented Jul 1, 2024

@bloebp I also have another question: I'm using DML from the EconMl package in along with dowhy for estimation. But I don't know how to access DML model characteristics like coef_ from the dowhy wrapper.

example code :

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
dml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DML",
control_value = 0,
treatment_value = 1,
target_units = ate,
confidence_intervals=False,
method_params={"init_params":{'model_y':GradientBoostingRegressor(),
'model_t': GradientBoostingRegressor(),
"model_final":LassoCV(fit_intercept=False),
'featurizer':PolynomialFeatures(degree=1, include_bias=False)},
"fit_params":{}})

@bloebp
Copy link
Member

bloebp commented Jul 1, 2024

Maybe @amit-sharma or @kbattocchi can help here.

Copy link

This issue is stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale label Jul 16, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

2 participants