CommunityDetection-Spark-AWS

A Spark application, written in Python3, to figure out strongly connected components with Bi-directional Label Propagation algorithm.

This project implemented an 1.3GB Twitter network dataset by AWS EMR cluster.

How to replicate the experiment

Upload labelp.py and dataset to your bucket in AWS S3. (if you already have an AWS account)
Create a cluster in AWS EMR.
- Launch mode : Step execution (You can also choose Cluster and use SSH to connect your cluster.)
- Step type : Spark application
  (configure)
  Name : labelp
  Deploy mode : cluster
  Spark-submit options : -- master yarn --driver-memory 4g --executor-memory 2g . Without setting memory, application may fail for memoryoverhead. (For more details : Running Spark on Yarn)
  Application location : choose labelp.py in your S3 bucket
  Action on failure : Terminate cluster (Recommended)
- Vendor : Amazon, Release : emr-5.2.0 (If you choose Cluster mode before, choose Application : Spark: Spark 2.0.2....)
- Instance type : m1.large
  Number of instances : 4
  (You can use other type and number of instances, but make sure that your total memory is larger than 13.91G, which was observed as the maximum memory used during the process.)
  (The whole time for computation was about 6 hour and 43 mins.)
- Permission : Default
  (If you choose Cluster mode before, upload your public key to AWS and select it here)
- Click Create cluster button.
  Done!
  Your cluster will be terminated automatically after the application is finished.

Where to find the dataset I used

Dataset : Twitter
R. Zafarani and H. Liu, (2009). Social Computing Data Repository at ASU [http://socialcomputing.asu.edu]. Tempe, AZ: Arizona State University, School of Computing, Informatics and Decision Systems Engineering

Results

Output format : ('Label',u'CommunitySize/Members')
See output-refined version.

Details of Algorithm

Please read Algorithm Instruction.pdf.

Want a Pseudo distributed version to test small datasets?

Please see pseudo mode.

Dataset in other format?

With a dataset using (space) or (tab) to separate follower and user,
change the following position:

(line 11 & 51) y=x.split(',') --> y=x.split()

Others

If you have any question or suggestion, please contact [email protected] or [email protected] .
Thanks!

Linghao Li
12/19/2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CommunityDetection-Spark-AWS

How to replicate the experiment

Where to find the dataset I used

Results

Details of Algorithm

Want a Pseudo distributed version to test small datasets?

Dataset in other format?

Others

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
output-refined version		output-refined version
output-spark		output-spark
pseudo mode		pseudo mode
Algorithm Instruction.pdf		Algorithm Instruction.pdf
README.md		README.md
labelp.py		labelp.py

linghaol/CommunityDetection-Spark-AWS

Folders and files

Latest commit

History

Repository files navigation

CommunityDetection-Spark-AWS

How to replicate the experiment

Where to find the dataset I used

Results

Details of Algorithm

Want a Pseudo distributed version to test small datasets?

Dataset in other format?

Others

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages