Auto assign_causal_mechanisms is taking so much time in gcm #1214

Abu-thahir · 2024-06-24T04:24:50Z

@bloebp @amit-sharma I tried to run the Online Sales Shop example, which is available here: https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_online_shop.html.

auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD); print(auto_assignment_summary)

This code method is running for hours and hours, with no output. Why is this so? Is this the intended behaviour? Also, the gcm.evaluate_causal_model method isn't working for me.

I also have a question: if I have a causal graph, should I explicitly apply causal mechanisms to each node before using it in gcm? If so, what are all the possible distributions that I should be able to set? Is there a reference for understanding causal mechanisms?

Version information:

DoWhy version 0.11.1

bloebp · 2024-06-24T14:26:35Z

Hi, I think someone else has reported a similar issue. It was due to using Python 3.12 (DoWhy only supports versions smaller than 3.12, e.g., 3.11) and the installed scikit-learn version. Can you double-check if you have DoWhy 0.11.1 installed (with Python 3.12, it will fall back to 0.8, I think).

Generally auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD) in this example should be quite fast (probably under 20 seconds). Can you try to uninstall scikit-learn and re-install it again (or upgrade it)?

Also, the gcm.evaluate_causal_model method isn't working for me.

Do you have an error message? If the method was not found, then it is most likely due to having an older DoWhy version installed.

I also have a question: if I have a causal graph, should I explicitly apply causal mechanisms to each node before using it in gcm? If so, what are all the possible distributions that I should be able to set? Is there a reference for understanding causal mechanisms?

Normally, each node requires a causal mechanism to describe its data generation process. The assign_causal_mechanisms function aims to automate this process with some "heuristics", so you don't have to do it manually. You can check

for more information about customizing them if you want to assign them manually. Generally, you can either prepare your own model or use an existing wrapper to, e.g., assign any SciPy distribution to root nodes or regression/classification models for (additive noise models in) non-root nodes. The example notebook https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_rca_microservice_architecture.html shows some reasoning process about selecting the models manually.

Abu-thahir · 2024-06-28T05:49:32Z

@bloebp Thank you for helping me out. I resolved the difficulties by downgrading my Python version. However, there is another issue with OneHotEncoder in dowhy's util package.

Issue 1 :

45     if drop_first:
     46         drop = "first"
---> 47     encoder = OneHotEncoder(drop=drop, sparse=False)  # NB sparse renamed to sparse_output in sklearn 1.2+
     49     encoded_data = encoder.fit_transform(data_to_encode)
     51 else:  # Use existing encoder

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

The parameter for OneHotEncoder (sparse) in Scikit-learn has been renamed "sparse_output". This must be modified; else, an encoding error will occur.

I attempted to downgrade Scikit-learn below 1.2.0, but encountered other dependence wheel package difficulties. So i think its must to make this change!

bloebp · 2024-06-28T14:27:46Z

I think I found the code piece related to this. I am wondering, since this is not part of the GCM package, do you manually encode the categorical variables? If you use gcm.fit it would already automatically encode categorical data accordingly, no need to do this manually.

bloebp · 2024-06-28T14:41:11Z

Opened a fix PR: #1219

Abu-thahir · 2024-06-28T15:36:41Z

I am wondering, since this is not part of the GCM package, do you manually encode the categorical variables? If you use gcm.fit it would already automatically encode categorical data accordingly, no need to do this manually.

@bloebp ,
I haven't used gcm, but I've worked extensively with the dowhy API. In my use case, I want to perform causal analysis on many treatments over the same outcome variables, and the system should be scalable enough to handle concurrent queries.
For example, I'm looking for the causal effect values for a treatment variable named Campaign Name, which has 6 to 7 campaigns, vs the outcome variable "Sales". So I OneHotEncode the data manually and limit the number of unique values for variables to ten because onehotencoding causes the curse of dimensionality.

I am curious if it is possible to serve requests concurrently if I provide the whole categorical data with 50-100 unique values directly to fit. Also, would this be scalable, or will it cause large memory consumption difficulties as dimensionality increases?

bloebp · 2024-06-28T15:44:05Z

Ah ok got it. In this case, since you have particular target variable in mind, maybe you can check alternative encoding methods, such as CatBoostEncoder. We have an implementation here: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/util/catboost_encoder.py#L37

Basically what you can try is:

my_encoder = CatBoostEncoder()
df['MyCategoricalColumn'] = my_encoder.fit_transform(X=df['MyCategoricalColumn'].to_numpy().reshape(-1), Y=df['MyTargetVariable'].to_numpy().reshape(-1))

Abu-thahir · 2024-06-28T15:55:45Z

However, CatBooEncoder will encode the column called Campaign_Name within the same column, which is a categorical treatment variable, but what i need here is to get the ATE values for each campaign over the outcome variable "Sales".

If I do onehotencoding, I may retrieve the ATE value for each campaign because it is viewed as another treatment variable against the outcome "Sales".

Is it possible to obtain the ATE values for each Campaign in the Campaign_Name variable using catBoost in GCM?

bloebp · 2024-06-28T16:15:10Z

Ah ok, yea the intervention value becomes rather abstract if you do a catboost encoding, while you still have a clear interpretation in one-hot-encodings.

So, in case of GCM, you can explicitly set the campaign value and see the effect, the transformation (e.g. catboost) would then happen internally. You can check this: https://www.pywhy.org/dowhy/v0.11.1/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_gcm.html

Basically, you need to compare two reference treatments, like

gcm.average_causal_effect(causal_model,
                         'Sales',
                         interventions_alternative={'Campaigns': lambda x: 'MyFirstCampaign'},
                         interventions_reference={'Campaigns': lambda x: 'MySecondCampaign'},
                         num_samples_to_draw=1000)

Generally, using DML for effect estimation might be more robust than a GCM, but you can give it a shot.

Abu-thahir · 2024-07-01T11:09:51Z

@bloebp I also have another question: I'm using DML from the EconMl package in along with dowhy for estimation. But I don't know how to access DML model characteristics like coef_ from the dowhy wrapper.

example code :

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
dml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DML",
control_value = 0,
treatment_value = 1,
target_units = ate,
confidence_intervals=False,
method_params={"init_params":{'model_y':GradientBoostingRegressor(),
'model_t': GradientBoostingRegressor(),
"model_final":LassoCV(fit_intercept=False),
'featurizer':PolynomialFeatures(degree=1, include_bias=False)},
"fit_params":{}})

bloebp · 2024-07-01T15:28:33Z

Maybe @amit-sharma or @kbattocchi can help here.

github-actions · 2024-07-16T01:54:21Z

This issue is stale because it has been open for 14 days with no activity.

github-actions · 2024-07-24T01:52:42Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

Abu-thahir added the question Further information is requested label Jun 24, 2024

github-actions bot added the stale label Jul 16, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto assign_causal_mechanisms is taking so much time in gcm #1214

Auto assign_causal_mechanisms is taking so much time in gcm #1214

Abu-thahir commented Jun 24, 2024 •

edited

Loading

bloebp commented Jun 24, 2024

Abu-thahir commented Jun 28, 2024

bloebp commented Jun 28, 2024

bloebp commented Jun 28, 2024

Abu-thahir commented Jun 28, 2024

bloebp commented Jun 28, 2024

Abu-thahir commented Jun 28, 2024

bloebp commented Jun 28, 2024 •

edited

Loading

Abu-thahir commented Jul 1, 2024 •

edited

Loading

bloebp commented Jul 1, 2024

github-actions bot commented Jul 16, 2024

github-actions bot commented Jul 24, 2024

Auto assign_causal_mechanisms is taking so much time in gcm #1214

Auto assign_causal_mechanisms is taking so much time in gcm #1214

Comments

Abu-thahir commented Jun 24, 2024 • edited Loading

bloebp commented Jun 24, 2024

Abu-thahir commented Jun 28, 2024

bloebp commented Jun 28, 2024

bloebp commented Jun 28, 2024

Abu-thahir commented Jun 28, 2024

bloebp commented Jun 28, 2024

Abu-thahir commented Jun 28, 2024

bloebp commented Jun 28, 2024 • edited Loading

Abu-thahir commented Jul 1, 2024 • edited Loading

bloebp commented Jul 1, 2024

github-actions bot commented Jul 16, 2024

github-actions bot commented Jul 24, 2024

Abu-thahir commented Jun 24, 2024 •

edited

Loading

bloebp commented Jun 28, 2024 •

edited

Loading

Abu-thahir commented Jul 1, 2024 •

edited

Loading