flowrunner is a lightweight package to organize and represent Data Engineering/Science workflows. Its designed to be integrated with any pre-existing framework like pandas or PySpark
- Lazy evaluation of DAG: flowrunner does not force you to execute/run your dag until you want to, only run it when its explicitly mentioned as run
- Easy syntax to build new Flows
- Easy data sharing between methods in a Flow using attributes
- Data store to store output of a function(incase it has return) for later
- Param store to easily pass reusable parameters to Flow
- Visualizing your flow as a DAG
- Improved DAG visualization with description with option to turn off description
- Improved style of DAG visualization
- Improved documentation for readme
- Improved example usage for pandas
- Improved checks for cyclic flows
- Support for PySpark
pip install flowrunner[pyspark]
- Improved validation for stranded middle origin nodes
- Changed theme to sphinx_the_docs
- Added API reference documentation
- Improved documentation examples with Databricks and PySpark
- Add cookie cutter template
- Improved logging
- Fixed broken notebook example links