Airflow & Python Example Pipeline Guide

This repository contains an example pipeline demonstrating how to create a data pipeline using Airflow, Python, Docker and MoJ's own Python modules. The pipeline performs various tasks including loading data from an S3 bucket, casting columns to correct types, adding metadata columns, writing curated tables to S3, moving completed files to a raw hist folder, and applying slowly changing dimension type 2 (SCD2) transformations.

Fork Repository

Navigate to the airflow-de-intro-project repository on GitHub.com.
Click Fork in the top-right corner.
Select the owner (moj-analytical-services) for the forked repository.
Rename the forked repository as airflow-de-intro-project-{username}.
Optionally, provide a description for your fork.
Click Create fork.

Set Up and Install Docker

Docker is used to containerize the pipeline. Follow these steps to set up and install Docker:

Install Docker Desktop: If you're using a MacBook, download and install Docker Desktop from here.

Testing Docker Image

Follow these steps to build and test your Docker image locally:

Clone Repository: Clone the Airflow repository to your local machine.
Navigate to Directory: Open a terminal session and navigate to the directory containing the Dockerfile using the cd command.
Build Docker Image: Build the Docker image by running:
```
docker build . -t IMAGE:TAG
```
Replace IMAGE with a name for the image (e.g., my-docker-image) and TAG with the version number (e.g., v0.1).
Run Docker Container: Run a Docker container created from the Docker image by running:
```
docker run IMAGE:TAG
```
This will execute the command specified in the Dockerfile's CMD line.

Pass Environment Variables: If your command requires access to resources on the Analytical Platform, such as data stored in Amazon S3, pass the necessary environment variables to the Docker container. Example:

docker run \
    --env AWS_REGION=$AWS_REGION \
    --env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION \
    --env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
    --env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
    --env AWS_SECURITY_TOKEN=$AWS_SECURITY_TOKEN \
    IMAGE:TAG

Debugging: For debugging and troubleshooting purposes, start a bash session in a running Docker container by running:
```
docker run -it IMAGE:TAG bash
```

Pipeline Tasks

Load Data from S3

Load dataset from an S3 bucket and return it as a Pandas DataFrame.
Three parquet files, representing extractions from a source database, are found in data/example-data/ in this repository, with mojap metadata found in data/metadata/. These have been adapted from sample-csv-files "people-100000.csv".
Utilize the read() method of an arrow_pd_parser reader object.
Arrow PD Parser.

Cast Columns to Correct Types

Compare data types of each column in the DataFrame and ensure they match with the expected types from the provided Mojap Metadata.
Mojap Metadata.

Add Mojap Columns to DataFrame

Add a set of columns to the DataFrame and metadata derived from environment variables for traceability.
Columns to add:
- "mojap_start_datetime"
- "mojap_image_tag"
- "mojap_raw_filename"
- "mojap_task_timestamp"

Write Curated Table to S3

Write transformed data to an appropriate S3 bucket in .parquet format.
Register the table with the Glue Catalogue.
Utilize AWS Wrangler for this task.

Move Completed Files to Raw Hist

Move processed files from the Land folder to the Raw Hist folder for maintaining a history of data sources over time.

Apply SCD2

Apply slowly changing dimension Type 2 (SCD2) transformations based on the mojap_start_datetime column to handle updates to data entries.
Further instructions to be provided.

Follow these steps to complete the tasks outlined in the pipeline.

Example Python Functions

Use the provided Python functions as a guide to implement each stage of the pipeline in your environment.

graph LR
    B[Fork Repository] --> C[Set Up and Install Docker]
    C --> D[Load Data from S3]
    D --> E[Cast Columns to Correct Types]
    E --> F[Add Mojap Columns to DataFrame]
    F --> G[Write Curated Table to S3]
    G --> H[Move Completed Files to Raw Hist]
    H --> I[Apply SCD2]

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
data		data
docs		docs
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements-lint.txt		requirements-lint.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airflow & Python Example Pipeline Guide

Fork Repository

Set Up and Install Docker

Testing Docker Image

Pipeline Tasks

Load Data from S3

Cast Columns to Correct Types

Add Mojap Columns to DataFrame

Write Curated Table to S3

Move Completed Files to Raw Hist

Apply SCD2

Example Python Functions

About

Releases 5

Packages

Contributors 4

Languages

License

moj-analytical-services/airflow-de-intro-project

Folders and files

Latest commit

History

Repository files navigation

Airflow & Python Example Pipeline Guide

Fork Repository

Set Up and Install Docker

Testing Docker Image

Pipeline Tasks

Load Data from S3

Cast Columns to Correct Types

Add Mojap Columns to DataFrame

Write Curated Table to S3

Move Completed Files to Raw Hist

Apply SCD2

Example Python Functions

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 4

Languages

Packages