Skip to content

Baseline Data Quality Report

Rajiv Sambasivan edited this page Mar 11, 2024 · 1 revision

A common task when we receive a raw dataset is to get a quality baseline on it. Some of the things we may want to know are:

  • How big is the dataset? How may rows, how many columns
  • What are the data types for each of the columns as inferred by the tool with no supplementary information
  • Which columns have missing values? How many are missing

This is the kind of task that can be automated. In this recipe, the prefect workflow tool is used, but it should be straight forward to map it to another tool, say kedro or argo or whatever tool you use in your environment. Note that the recipe has explicit connection steps for reading source files and writing the knowledge base. This is because the source and destination can be different buckets. The example uses the IT service management dataset. Please review the kmds results to see what the basic pipeline produces.

In the course of analysis, analysts will develop data representations suitable for thier application needs. As and when this happens, tools like pandera or great expectations can be used to add the data quality checks relevant to the particular use case. See this noteboook for an example.