Big Data Analysis with Scala and Spark

Assignments code for Big Data Analysis with Scala and Spark course (Coursera EPFL)

Assignments

Final Grade 100%

Week 1: Wikipedia

→ Your overall score for this assignment is 10.00 out of 10.00

Week 2-3: StackOverflow

→ Your overall score for this assignment is 10.00 out of 10.00

Week 4: Time usage

→ Your overall score for this assignment is 10.00 out of 10.00

Details

Week 2-3: StackOverflow

Using the Spark web UI, we visualize the events timeline and DAGs.

Extracting vectors

Stages 1 and 2: load questions and answers.
Stage 3: groupedPostings, scoredPostings, vectorPostings
Stage 4: sampleVectors

K-Means clustering

Jobs 2 to 46 apply the k-means algorithm on the sampleVectors cached in previous step.

For each step, centroids are updated and collected to the driver to evaluate convergence, stopping when it is reached.

Week 4: Time usage

The data set analyzed originates from the American Time Use Survey (ATUS) data, from 2003-2015, via Kaggle. It measures how people divide their time among misc life activities.

Displaying data with Zeppelin

We load the resulting dataset in Apache Zeppelin

Install

Wget archive from website
Untar
Run: SPARK_LOCAL_IP=127.0.0.1 zeppelin-0.7.1/bin/zeppelin-daemon.sh start
Stop: zeppelin-0.7.1/bin/zeppelin-daemon.sh stop

nb: SPARK_LOCAL_IP is set to workaround a port unable to bind exception on 0.7.1

Prepare data export

Export the resulting week 4 dataset as JSON.

1) From the Spark environment, export data to disk:

finalDf.coalesce(1) // (1)
 .write.json("dataset-week4.json")

Repartition to obtain only 1 output file (else 1 per partition)

2) Upload to the host running Zeppelin, or wget from it (%sh then wget …)

Zeppelin

Connect to the Zeppelin web UI on http://localhost:8080, and create a new notebook with the following content.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlData = sqlContext.jsonFile("dataset-week4.json")
sqlData.registerTempTable("data")

%sql SELECT * FROM data ORDER BY work DESC

Display as bar graph:

nb: sort order seems not to be respected, as per open issue ZEPPELIN-87

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
res/static		res/static
week1-wikipedia		week1-wikipedia
week2-3-stackoverflow		week2-3-stackoverflow
week4-timeuse		week4-timeuse
.gitignore		.gitignore
README.adoc		README.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Analysis with Scala and Spark

Assignments

Details

Week 2-3: StackOverflow

Extracting vectors

K-Means clustering

Week 4: Time usage

Displaying data with Zeppelin

Install

Prepare data export

Zeppelin

About

Releases

Packages

Languages

arnaudj/mooc-spark-coursera-bigdata-analysis-spark-epfl

Folders and files

Latest commit

History

Repository files navigation

Big Data Analysis with Scala and Spark

Assignments

Details

Week 2-3: StackOverflow

Extracting vectors

K-Means clustering

Week 4: Time usage

Displaying data with Zeppelin

Install

Prepare data export

Zeppelin

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages