[BUG] Aggregations on Spark dataframes fail intermittently #392

jstammers · 2022-12-04T17:48:56Z

Minimal Code To Reproduce

from fugue import DataFrame, FugueWorkflow
from fugue.column import lit, col
import pandas as pd

def aggregate_prices(
    df: DataFrame,
    rollup: DataFrame,
) -> DataFrame:
    agg_order = [["a"], ["a", "b"], ["a","b", "c"]]
    agg_levels = [1,2,3]
    for i, (cols, gid) in enumerate(zip(agg_order, agg_levels)):
        levels = ",".join(cols)
        price = rollup.filter(col("group_id") == gid).select(
            *cols
            + [
                lit(levels).alias(f"level_{i}"),
            ]
        )
        df = df.join(price, how="left_outer", on=cols)
    return df

prices= pd.DataFrame({"a":[1,2,3], "b":[1,2,3], "c":[1,1,2], "price":[0.1,0.2,0.3], "group_id":[1,2,3]})
df = pd.DataFrame({"a": [1,1,2,3], "b":[1,2,3,4], "c":[1,2,3,4]})


dag = FugueWorkflow()
df_f  = dag.df(df)
prices_f = dag.df(prices)
agg = aggregate_prices(df_f, prices_f)
dag.run('spark')

Describe the bug
When running the above code, I intermittently encounter the following error

ERROR:root:_5 _State.RUNNING -> _State.FAILED  Table or view not found: _a2656; line 1 pos 14;
'Project [*]
+- 'Filter ('group_id = 2)
   +- 'UnresolvedRelation [_a2656], [], false

AnalysisException: Table or view not found: _a2656; line 1 pos 14;
'Project [*]
+- 'Filter ('group_id = 2)
   +- 'UnresolvedRelation [_a2656], [], false

I have observed this issue in a few of my pipelines. From what I have seen, the error seems to occur during inline transformations,
e.g.

agg = df.filter(...).partition_by(...).aggregate(...)

Expected behavior
This transformation should execute without failing every time the workflow is run

Environment (please complete the following information):

Backend: spark
Backend version: pyspark - 3.3.0
Python version: 3.9
OS: linux

The text was updated successfully, but these errors were encountered:

goodwanghan · 2022-12-05T09:12:07Z

Hi, thanks for reporting. I remember Spark had an issue on diamond joins. If B and C are simple selected results from A, joining B and C will throw an error saying C is not found when you use Spark SQL.

I think that bug has been resolved in later version of Spark (>2.4). But I feel this is very similar.

There are two things you can try:

You can add persist to rollup df (breaking the lineage may help):

prices_f = dag.df(prices).persist()

You can also call join once because join can take in multiple dataframes together, this will change the structure of the Spark SQL, although I don't believe it should change the underlying execution plan, since we are dealing with an unknown bug who knows.

https://github.com/fugue-project/fugue/blob/master/fugue/workflow/workflow.py#L587

jstammers · 2022-12-07T08:25:07Z

Hi, thanks for the reply. I will try persisting the dataframes and see if that resolves the issue.
I have just encountered the problem during the following

    449     df_f = dag.df(df)
    450     keys = df_f.select("CGAIdent", "SingleBuyProductItemId", "State").distinct()
--> 451     date_keys = df_f.select("USPeriod").distinct()
    452     keys = date_keys.join(dates_f, on=["USPeriod"], how="inner").join(keys, how="cross")

which I can't understand. If it helps, I am using the 11.4 DBR

goodwanghan · 2022-12-28T08:48:09Z

@jstammers I apologize for the delay. I have a theory how it happened. I am currently in a very big code change (almost finished) I will also include the fix, and if possible please help us test.

goodwanghan · 2022-12-30T08:54:25Z

@jstammers I think the problem is resolved in the latest pre-release. I was able to reproduce the issue once in 0.7.3 but not able to reproduce in 0.8.0.dev3.

Also from 0.8.0, you no longer need to use FugueWorkflow, here is the modified version

import fugue.api as fa
from fugue.column import lit, col
from fugue import AnyDataFrame
import pandas as pd

def aggregate_prices(
    df: AnyDataFrame,
    rollup: AnyDataFrame,
) -> AnyDataFrame:
    agg_order = [["a"], ["a", "b"], ["a","b", "c"]]
    agg_levels = [1,2,3]
    for i, (cols, gid) in enumerate(zip(agg_order, agg_levels)):
        levels = ",".join(cols)
        price = fa.select(
            rollup,
            *cols
            + [
                lit(levels).alias(f"level_{i}"),
            ],
            where = col("group_id") == gid
        )
        df = fa.left_outer_join(df, price)
    return df

prices= pd.DataFrame({"a":[1,2,3], "b":[1,2,3], "c":[1,1,2], "price":[0.1,0.2,0.3], "group_id":[1,2,3]})
df = pd.DataFrame({"a": [1,1,2,3], "b":[1,2,3,4], "c":[1,2,3,4]})


with fa.engine_context(spark):
    agg = aggregate_prices(df, prices)
    fa.show(agg)

You can change to None or duckdb in fa.engine_context to verify locally without spark.
agg in this code is just a pyspark DataFrame.
AnyDataFrame is just a readable type annotation, you can use Any instead

goodwanghan · 2022-12-30T09:03:08Z

Also if you want to iterate on notebook, you can run

fa.set_global_engine(spark)

in a cell, and then you don't need to specify engine again, you won't need the with statement.

goodwanghan linked a pull request Dec 29, 2022 that will close this issue

Add fugue API #396

Merged

goodwanghan closed this as completed in #396 Dec 30, 2022

goodwanghan added this to the 0.8.0 milestone Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Aggregations on Spark dataframes fail intermittently #392

[BUG] Aggregations on Spark dataframes fail intermittently #392

jstammers commented Dec 4, 2022

goodwanghan commented Dec 5, 2022

jstammers commented Dec 7, 2022

goodwanghan commented Dec 28, 2022

goodwanghan commented Dec 30, 2022

goodwanghan commented Dec 30, 2022 •

edited

Loading

[BUG] Aggregations on Spark dataframes fail intermittently #392

[BUG] Aggregations on Spark dataframes fail intermittently #392

Comments

jstammers commented Dec 4, 2022

goodwanghan commented Dec 5, 2022

jstammers commented Dec 7, 2022

goodwanghan commented Dec 28, 2022

goodwanghan commented Dec 30, 2022

goodwanghan commented Dec 30, 2022 • edited Loading

goodwanghan commented Dec 30, 2022 •

edited

Loading