Skip to content

Latest commit

 

History

History
127 lines (82 loc) · 5.58 KB

README.md

File metadata and controls

127 lines (82 loc) · 5.58 KB

Great Expectations tutorial

Step 1: Setup

Install Great Expectations and initialize a Data Context.

    $ pip install great_expectations
    $ great_expectations --version
    $ great_expectations init

About the great_expectations directory structure

After running the init command, your great_expectations directory will contain all of the important components of a local Great Expectations deployment. Some explanations for the files that have been created:

  • great_expectations.yml - contains the main configuration of your deployment.
  • expectations/ - will contain all the Expectations as JSON files. (location is configurable)
  • plugins/ - contains the code for custom plugins you develop as part of your deployment.
  • uncommitted/ - contains files that shouldn’t be in version control. It has a .gitignore configured to exclude all its contents from version control.

The main contents of uncommited are:

  • uncommitted/config_variables.yml - contains sensitive information, such as database credentials and other secrets.
  • uncommitted/data_docs - contains Data Docs generated from Expectations, Validation Results, and other metadata.
  • uncommitted/validations - contains Validation Results generated by Great Expectations.

Terminology

  • Data Context: The folder structure that contains the entirety of your Great Expectations project. It is also the entry point for accessing all the primary methods for creating elements of your project, configuring those elements, and working with the metadata for your project.

  • CLI: The Command Line Interface for Great Expectations. The CLI provides helpful utilities for deploying and configuring Data Contexts, as well as a few other convenience methods.

Step 2: Connect to Data

Creat and configure Datasource.

    $ great_expectations datasource new

Use filesystem and pandas. The root path is data:.

Terminology

  • Datasource: An object that brings together a way of interacting with data (an Execution Engine) and a way of accessing that data (a Data Connector). Datasources are used to obtain Batches for Validators, Expectation Suites, and Profilers.

  • Jupyter Notebooks: These notebooks are launched by some processes in the CLI. They provide useful boilerplate code for everything from configuring a new Datasource to building an Expectation Suite to running a Checkpoint.

Step 3: Create Expectations

Use the automatic Profiler to build an Expectation Suite.

    $ great_expectations suite new
    $ great_expectations suite edit bike_theft_berlin.demo

Workflow

  • Let Great Expectations create a simple first draft suite, by running great_expectations suite new.
  • View the suite in Data Docs.
  • Edit the suite in a Jupyter notebook by running great_expectations suite edit
  • Repeat Steps 2-3 until you are happy with your suite.
  • Commit this suite to your source control repository.

Terminology

  • Expectation Suite: A collection of Expectations.

  • Expectations: A verifiable assertion about data. Great Expectations is a framework for defining Expectations and running them against your data. In the tutorial's example, we asserted that NYC taxi rides should have a minimum of one passenger. When we ran that expectation against our second set of data Great Expectations reported back that some records in the new data indicated a ride with zero passengers, which failed to meet this expectation.

  • Profiler: A tool that automatically generates Expectations from a Batch of data.

Step 4: Validate Data

Create a Checkpoint which can be used to validate new data. The Validation Results can be viewed in Data Docs.

$ great_expectations checkpoint new bike_theft_checkpoint

Terminology

  • Checkpoint: An object that uses a Validator to run an Expectation Suite against a batch of data. Running a Checkpoint produces Validation Results for the data it was run on.

  • Validation Results: A report generated from an Expectation Suite being run against a batch of data. The Validation Result itself is in JSON and is rendered as Data Docs.

  • Data Docs: Human readable documentation that describes Expectations for data and its Validation Results. Data Docs van be generated both from Expectation Suites (describing our Expectations for the data) and also from Validation Results (describing if the data meets those Expectations).

Command line help

$ great_expectations --help

Usage: great_expectations [OPTIONS] COMMAND [ARGS]...

  Welcome to the great_expectations CLI!

  Most commands follow this format: great_expectations <NOUN> <VERB>

  The nouns are: checkpoint, datasource, docs, init, project, store, suite,
  validation-operator. Most nouns accept the following verbs: new, list, edit

Options:
  --version                Show the version and exit.
  --v3-api / --v2-api      Default to v3 (Batch Request) API. Use --v2-api for
                           v2 (Batch Kwargs) API
  -v, --verbose            Set great_expectations to use verbose output.
  -c, --config TEXT        Path to great_expectations configuration file
                           location (great_expectations.yml). Inferred if not
                           provided.
  -y, --assume-yes, --yes  Assume "yes" for all prompts.
  --help                   Show this message and exit.

Commands:
  checkpoint  Checkpoint operations
  datasource  Datasource operations
  docs        Data Docs operations
  init        Initialize a new Great Expectations project.
  project     Project operations
  store       Store operations
  suite       Expectation Suite operations