Skip to content

Docker with Airflow + Postgres + Spark cluster + JDK (spark-submit support) + Jupyter Notebooks

Notifications You must be signed in to change notification settings

pyjaime/docker-airflow-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

docker-airflow-spark

Docker with Airflow + Postgres + Spark cluster + JDK (spark-submit support) + Jupyter Notebooks

📦 The Containers

  • airflow-webserver: Airflow webserver and scheduler, with spark-submit support.

  • postgres: Postgres database, used by Airflow.

    • image: postgres:13.6
    • port: 5432
  • spark-master: Spark Master.

    • image: bitnami/spark:3.2.1
    • port: 8081
  • spark-worker[-N]: Spark workers (default number: 1). Modify docker-compose.yml file to add more.

    • image: bitnami/spark:3.2.1
  • jupyter-spark: Jupyter notebook with pyspark support.

    • image: jupyter/pyspark-notebook:spark-3.2.1
    • port: 8888

🛠 Setup

Clone project

$ git clone https://github.com/pyjaime/docker-airflow-spark

Build airflow Docker

$ cd docker-airflow-spark/airflow/
$ docker build --rm -t docker-airflow2:latest .

Setup the sandbox

The sandbox will contain the folders where data will be persisted from the containers, and some test files. We will create the folder easily:

$ cd docker-airflow-spark/
$ cp -R sandbox-test/. ../sandbox/

Launch containers

$ cd docker-airflow-spark/
$ docker-compose -f docker-compose.yml up -d

Check accesses

👣 Additional steps

Create a test user for Airflow

$ docker-compose run airflow-webserver airflow users create --role Admin --username admin \
  --email admin --firstname admin --lastname admin --password admin

Edit connection from Airflow to Spark

  • Go to Airflow UI > Admin > Edit connections
  • Edit spark_default entry:
    • Connection Type: Spark
    • Host: spark://spark
    • Port: 7077

Test spark-submit from Airflow

Go to the Airflow UI and run the test_spark_submit_operator DAG :)

🎉 A big thank you

THANK YOU Thiago Cordon

About

Docker with Airflow + Postgres + Spark cluster + JDK (spark-submit support) + Jupyter Notebooks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published