The objectives of this project are to get experience of coding with:
- Spark
- Spark SQL
- Spark Streaming
- Kafka
- Scala and functional programming
The data set is the one that you analyzed in Course 1 and it is STM GTFS data.
We get the information of STM every day and need to run an ETL pipeline to enrich data for reporting and analysis purpose in real-time. Data is split in two
- A set of tables that build dimension (batch style)
- Stop times that needed to be enriched for analysis and reporting (streaming)