Skip to content

manaranjanp/spark-dev-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Developer Training

The repository contains guides, code and data required for the Spark Developer workshop conducted by Manaranjan Pradhan

Participants can download the repository as zip file and save to their laptop or desktop. The VM image for running the programs will be prodivded separately during the training. The VM image will have hadoop and spark installed on it. The VM will also contain data and code required during the workshop.

Once the zip file is downloaded, unzip the content to your desktop or laptop.

Then go to guides folder and open Spark Lab Guide Ver 1.0.pdf. The Spark Lab Guide Ver 1.0 guide will take you through all the lab exercises during the workshop.

Overview

Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, open source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs. This course will provide you an excellent kick start in building your fundamentals in developing big data solutions using Apache Spark platform. The course is well balanced between theory and hands-on lab (more than 10 lab exercises) spread on real world uses cases.

What participants will learn?

The attendees will learn below topics through lectures and hands-on exercises

  • Deep Dive into Apache Spark 1.5 Architecture
  • Understand Spark APIs, RDDs, Data frames, Spark SQL
  • How to do parallel programming and develop Spark applications
  • How Spark run on standalone and Cluster including Hadoop?
  • Understand Advanced Features and spark internals
  • Develop Spark Streaming Applications
  • Write advanced algorithms using Spark Machine Learning(ML) Library
  • Optimizing and tuning spark applications
  • End to End Use Case Implementation

Duration

3 Days

Intended Audience

Architects, developers & data scientists who wish to write, build and maintain Apache Spark jobs.

Prerequisites

All the programming will be done using Python, hence the participants should have basic programming knowledge of Python. It is advised to refresh these skills to obtain maximum benefit from this workshop. 

Detailed Course Outline

Big Data & Spark Overview

  • Overview of Big data and its challenges
  • Spark Architecture Overview
  • Installing and Configuring Spark

Spark Architecture – Deep Dive

  • Using Spark Shell
  • Understanding Resilient Distributed Datasets (RDDs), Types of RDDs
  • Working with RDD Actions & Transformations
  • Complete Flow of a spark program
  • Deploying to Spark Standalone & Hadoop Cluster
  • Using Web UI for monitoring & managing Spark Applications
  • Hands On

Spark APIs & Usages

  • Working with Key-value pairs using Spark APIs
  • Overview of RDD lineage, Caching and Persistence
  • Share Variables: Accumulators and Broadcast Variables
  • Integrating with different data sources including HDFS
  • Logging & Unit Testing
  • Track Spark jobs stages for Investigation and Troubleshooting
  • Hands On

Working with Advanced Spark Features

  • Working with Spark SQL
  • Working with DataFrames
  • Hive & RDD Integrations
  • Working with different data formats: Structured and Unstructured
  • Hands On

Writing Spark Streaming Applications

  • Spark Streaming Overview
  • Understanding Streaming Operations
  • Sliding Window Operations
  • Developing Spark Streaming Applications
  • Hands On

Using Spark Machine Learning Algorithms

  • Understanding ML APIs
  • Applying Regression, Classification and Clustering APIs to real world use cases
  • Hands On

Optimizing and Tuning Spark Applications

Instructor Profile

Manaranjan Pradhan is a big data & analytics enthusiast. He worked with TCS, HP and iGate patni for 15 years before deciding to quit and be a freelancer.  Now he teaches and consults on big data platforms like Hadoop, Spark and scalable machine learning.  He is an alumni of IIM Bangalore and currently also teaching and doing research projects at IIM Bangalore.

mail: [email protected]
https:// www.linkedin.com/in/manaranjanpradhan

He write blogs at:

http://blog.enablecloud.com/
http://www.awesomestats.in/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages