Skip to content

Extending the Great Expectation SparkDFDataset for better integration with Databricks

Notifications You must be signed in to change notification settings

sjrusso8/databricks-great-expectations-extended

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extending Great Expectations to integrate better with Databricks

Great Expectations is an amazing python library for data quality. It comes with integrations for Apache Spark, and dozens of preconfigured data expectations. Databricks is a top-tier data platform built on Spark. So you'd expect them to integrate seamlessly, but that is not quite the case…

This repo shows how you can extend the base great expectation dataset SparkDFDataset to achieve a better intergation with Databricks

In this repo

  • 'databrick_great_expectations' : contains the extended class of SparkDFDataset with additional methods

    • .validate_and_save() : create the validation object, and saves the result as JSON to a storage location
  • 'examples' : contains 2 notebooks as examples on how to use the extended class and how to query the JSON data.

    • 'example_data_validation' : shows how to use the extended class inline with PySpark code
    • 'example_query_results' : shows how to use Databricks SQL to create a view of the JSON result data for additional querying

The accompanying medium article for the motiviation behind this repo can be read here

About

Extending the Great Expectation SparkDFDataset for better integration with Databricks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages