Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

OBJECTIVE

The objectives of this project are to get experience of coding with:

Spark
Spark SQL
Spark Streaming
Kafka
Scala and functional programming

DATA SET

The data set is the one that you analyzed in Course 1 and it is STM GTFS data.

PROBLEM STATEMENT

We get the information of STM every day and need to run an ETL pipeline to enrich data for reporting and analysis purpose in real-time. Data is split in two

A set of tables that build dimension (batch style)
Stop times that needed to be enriched for analysis and reporting (streaming)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
project		project
src/main/scala/com/etl/spark/streaming/scala		src/main/scala/com/etl/spark/streaming/scala
target		target
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

OBJECTIVE

DATA SET

PROBLEM STATEMENT

About

Releases

Packages

Languages

ManikHossain08/Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

Folders and files

Latest commit

History

Repository files navigation

Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

OBJECTIVE

DATA SET

PROBLEM STATEMENT

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages