The repository contains guides, code and data required for the Spark Developer workshop conducted by Manaranjan Pradhan
Participants can download the repository as zip file and save to their laptop or desktop. The VM image for running the programs will be prodivded separately during the training. The VM image will have hadoop and spark installed on it. The VM will also contain data and code required during the workshop.
Once the zip file is downloaded, unzip the content to your desktop or laptop.
Then go to guides folder and open Spark Lab Guide Ver 1.0.pdf. The Spark Lab Guide Ver 1.0 guide will take you through all the lab exercises during the workshop.
Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, open source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs. This course will provide you an excellent kick start in building your fundamentals in developing big data solutions using Apache Spark platform. The course is well balanced between theory and hands-on lab (more than 10 lab exercises) spread on real world uses cases.
The attendees will learn below topics through lectures and hands-on exercises
- Deep Dive into Apache Spark 1.5 Architecture
- Understand Spark APIs, RDDs, Data frames, Spark SQL
- How to do parallel programming and develop Spark applications
- How Spark run on standalone and Cluster including Hadoop?
- Understand Advanced Features and spark internals
- Develop Spark Streaming Applications
- Write advanced algorithms using Spark Machine Learning(ML) Library
- Optimizing and tuning spark applications
- End to End Use Case Implementation
3 Days
Architects, developers & data scientists who wish to write, build and maintain Apache Spark jobs.
All the programming will be done using Python, hence the participants should have basic programming knowledge of Python. It is advised to refresh these skills to obtain maximum benefit from this workshop.
- Overview of Big data and its challenges
- Spark Architecture Overview
- Installing and Configuring Spark
- Using Spark Shell
- Understanding Resilient Distributed Datasets (RDDs), Types of RDDs
- Working with RDD Actions & Transformations
- Complete Flow of a spark program
- Deploying to Spark Standalone & Hadoop Cluster
- Using Web UI for monitoring & managing Spark Applications
- Hands On
- Working with Key-value pairs using Spark APIs
- Overview of RDD lineage, Caching and Persistence
- Share Variables: Accumulators and Broadcast Variables
- Integrating with different data sources including HDFS
- Logging & Unit Testing
- Track Spark jobs stages for Investigation and Troubleshooting
- Hands On
- Working with Spark SQL
- Working with DataFrames
- Hive & RDD Integrations
- Working with different data formats: Structured and Unstructured
- Hands On
- Spark Streaming Overview
- Understanding Streaming Operations
- Sliding Window Operations
- Developing Spark Streaming Applications
- Hands On
- Understanding ML APIs
- Applying Regression, Classification and Clustering APIs to real world use cases
- Hands On
Manaranjan Pradhan is a big data & analytics enthusiast. He worked with TCS, HP and iGate patni for 15 years before deciding to quit and be a freelancer. Now he teaches and consults on big data platforms like Hadoop, Spark and scalable machine learning. He is an alumni of IIM Bangalore and currently also teaching and doing research projects at IIM Bangalore.
mail: [email protected]
https:// www.linkedin.com/in/manaranjanpradhan
He write blogs at: