You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, transform and out_transform are the only two utility function of Fugue. They are used extensively by users. However, if people want to do other data operations such as load, join and union, they have to use FugueWorkflow (or not using Fugue). So we should expand the collection of functions to make more operations scale agnostic and framework agnostic.
Here are the design goals of these functions:
Each function can be used independently and can directly operate on different dataframes with the consistent behaviors. For example inner_join can directly take Spark dataframes as the input and output a Spark DataFrame.
Each function can choose its own ExecutionEngine, and by default, the should use the engine in the current context (the concept of 'current context` is to be implemented)
These functions should not prevent using framework specific methods between them.
Using only the utility functions for representing a data workflow should make it framework agnostic.
For example
importfugue.utilsasfudefmy_logic(input1, intput2):
df1=fu.load(input1)
df2=fu.load(input2)
df3=fu.inner_join(df1, df2)
returnfu.transform(df3, my_func)
# unit testres=my_logic1(pandas_df1, pandas_df2)
assert_pd_df_eq(res, ...)
# using different engineswithmake_spark_engine():
spark_res_df=my_logic(spark_df1, "s3://..parquet")
The text was updated successfully, but these errors were encountered:
Currently,
transform
andout_transform
are the only two utility function of Fugue. They are used extensively by users. However, if people want to do other data operations such as load, join and union, they have to useFugueWorkflow
(or not using Fugue). So we should expand the collection of functions to make more operations scale agnostic and framework agnostic.Here are the design goals of these functions:
inner_join
can directly take Spark dataframes as the input and output a Spark DataFrame.ExecutionEngine
, and by default, the should use the engine in the current context (the concept of 'current context` is to be implemented)For example
The text was updated successfully, but these errors were encountered: