Purpose of this project is to set up a minimum Hadoop (3.x.x) cluster to test submission of spark jobs via YARN
2 Containers named master
and worker
Master has the following responsibilities: HDFS
- Name Node
- Data Node YARN
- Resource Manager
- Node Manager
- Timeline History Server Spark
- History Server Map Reduce
- Map Reduce History Server
Worker has the following responsibilities: HDFS
- Data Node YARN
- Node Manager
In this architecture, when jobs are submitted in cluster mode, the driver can be located on either master
or worker
.
Run docker-compose up -d
to start up the cluster locally.
Assuming the docker-compose file are launched locally, the following URLs should be accessible:
- Namenode UI at localhost:9870
- Resource Manager UI at localhost:8088
- Spark History Server at localhost:18080
- YARN Timeline History Server at localhost:19888
- Download and setup the corresponding version of spark with hadoop on your local machine.
- Set environment variable
HADOOP_CONF_DIR
to/path/to/local-hadoop-config
wherelocal-hadoop-config
is the directory at the root of this repository. This ensures that anyhdfs
orspark-submit
command will run with the options found in the relevant.xml
files. - Ensure the following entries are set in the host file:
127.0.0.1 host.docker.internal
127.0.0.1 master
127.0.0.1 worker
- Run
hdfs dfs -ls /
to confirm correct setup of hadoop client. - Run the following command to confirm correct setup of spark client.
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 2g --executor-cores 1 --conf "spark.eventLog.enabled=true" --conf "spark.eventLog.dir=hdfs:///spark-logs" ${SPARK_HOME}/examples/jars/spark-examples*.jar 10