Big Data Project with Spark

.
├── project
│   └── build.properties
├── spark-cluster
│   ├── apps
│   │   └── spark-big-data.jar
│   ├── build-images.sh
│   ├── data
│   │   └── movies.json
│   ├── docker
│   │   ├── base
│   │   │   └── Dockerfile
│   │   ├── spark-master
│   │   │   ├── Dockerfile
│   │   │   └── start-master.sh
│   │   ├── spark-submit
│   │   │   ├── Dockerfile
│   │   │   └── spark-submit.sh
│   │   └── spark-worker
│   │       ├── Dockerfile
│   │       └── start-worker.sh
│   ├── docker-compose.yml
│   └── env
│       └── spark-worker.sh
├── sql
│   └── db.sql
├── src
|   ├── META-INF
|   │   └── MANIFEST.MF
|   └── main
|       ├── resources
|       │   ├── data
|       │   │   └── manyDataFilesHere
|       │   └── warehouse
|       │       └── rtjvm.db
|       │           └── moreDataFilesHere
|       └── scala
|           ├── sparkbigdataproject
|           │   ├── TaxiApplication.scala
|           │   └── TaxiEconomicImpact.scala
|           ├── sparkdataframes
|           │   ├── Aggregations.scala
|           │   ├── AggregationsExercises.scala
|           │   ├── ColsAndExprsExercises.scala
|           │   ├── ColumnsAndExpressions.scala
|           │   ├── DFExercises.scala
|           │   ├── DataFramesBasics.scala
|           │   ├── DataSources.scala
|           │   ├── DataSourcesExercises.scala
|           │   ├── Joins.scala
|           │   └── JoinsExercises.scala
|           ├── sparkjardeployment
|           │   └── TestDeployApp.scala
|           ├── sparklowlevel
|           │   ├── RDDExercises.scala
|           │   ├── RDDs.scala
|           │   └── WorkingWithRDDs.scala
|           ├── sparksql
|           │   └── SparkSql.scala
|           └── sparktypesanddatasets
|               ├── CommonTypes.scala
|               ├── CommonTypesExercises.scala
|               ├── ComplexTypes.scala
|               ├── ComplexTypesExercises.scala
|               ├── Datasets.scala
|               ├── DatasetsExercises.scala
|               └── ManagingNulls.scala
├── .gitignore
├── build.sbt
├── docker-clean.sh
├── docker-compose.yml
├── psql.sh
└── README.md

In the final chapter of this project, I processed 35 GB of NYC Taxi Rides data, from 2009 to 2016, in compressed Parquet format (~400 GB when deflated into CSV, according to RockTheJVM's Spark Essentials course). I downloaded it from Academic Torrents and uploaded the files into an S3 bucket created as sparkproject-data. Afterwards, I created an EMR cluster, which was given default roles (additionally, also for the associated EC2 instances) to read data from the S3 bucket, and then write the output and logs into the sparkproject-data and sparkproject-logs buckets, respectively. The Jar file for the project was uploaded to S3, copied into the master node and executed locally on the cluster after SSHing into the master node.

To reduce costs, it should be possible to set up the EMR cluster within a dedicated VPC with a Gateway Endpoint for directing traffic to S3 via AWS's own network, and not through the public internet. However, the data transfer is small enough that the reduction in cost is not significant, since most of the cost is incurred by EC2 and EMR compute costs, at about 92% of $1.05 for this project.

Two possible architectures for this project are shown below. Since this is a single batch operation that can be resumed/restarted at any desired time, high-availability of the architecture is not a strict requirement.

Apart from a few changes in the docker-compose.yml file for the local Postgres database, the Spark cluster setup is the same as in the Spark Essentials GitHub page. Also ensure that the Spark version (v3.2.1) for the packaged Jar file is the same as the Spark version (v3.2.1) that the EMR cluster is going to be bootstrapped with (in case any NoSuchMethodErrors pop up, either locally or on the EMR cluster).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Project with Spark

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
project		project
spark-cluster		spark-cluster
sql		sql
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
docker-clean.sh		docker-clean.sh
docker-compose.yml		docker-compose.yml
psql.sh		psql.sh

ismaildawoodjee/spark-big-data

Folders and files

Latest commit

History

Repository files navigation

Big Data Project with Spark

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages