Assignments code for Big Data Analysis with Scala and Spark course (Coursera EPFL)
Final Grade 100%
-
Week 1: Wikipedia
→ Your overall score for this assignment is 10.00 out of 10.00
-
Week 2-3: StackOverflow
→ Your overall score for this assignment is 10.00 out of 10.00
-
Week 4: Time usage
→ Your overall score for this assignment is 10.00 out of 10.00
Using the Spark web UI, we visualize the events timeline and DAGs.
-
Stages 1 and 2: load questions and answers.
-
Stage 3: groupedPostings, scoredPostings, vectorPostings
-
Stage 4: sampleVectors
The data set analyzed originates from the American Time Use Survey (ATUS) data, from 2003-2015, via Kaggle. It measures how people divide their time among misc life activities.
We load the resulting dataset in Apache Zeppelin
-
Wget archive from website
-
Untar
-
Run:
SPARK_LOCAL_IP=127.0.0.1 zeppelin-0.7.1/bin/zeppelin-daemon.sh start
-
Stop:
zeppelin-0.7.1/bin/zeppelin-daemon.sh stop
nb: SPARK_LOCAL_IP is set to workaround a port unable to bind exception on 0.7.1
Export the resulting week 4 dataset as JSON.
1) From the Spark environment, export data to disk:
finalDf.coalesce(1) // (1)
.write.json("dataset-week4.json")
-
Repartition to obtain only 1 output file (else 1 per partition)
2) Upload to the host running Zeppelin, or wget from it (%sh
then wget …
)
Connect to the Zeppelin web UI on http://localhost:8080, and create a new notebook with the following content.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlData = sqlContext.jsonFile("dataset-week4.json")
sqlData.registerTempTable("data")
%sql SELECT * FROM data ORDER BY work DESC
Display as bar graph:
nb: sort order seems not to be respected, as per open issue ZEPPELIN-87