Skip to content

A Spark application, written in Python, to figure out strongly connected components with Bi-directional Label Propagation algorithm. This project implemented an 1.3GB Twitter network dataset on AWS EMR cluster.

Notifications You must be signed in to change notification settings

linghaol/CommunityDetection-Spark-AWS

Repository files navigation

CommunityDetection-Spark-AWS

A Spark application, written in Python3, to figure out strongly connected components with Bi-directional Label Propagation algorithm.

This project implemented an 1.3GB Twitter network dataset by AWS EMR cluster.

How to replicate the experiment

  • Upload labelp.py and dataset to your bucket in AWS S3. (if you already have an AWS account)

  • Create a cluster in AWS EMR.

    • Launch mode : Step execution (You can also choose Cluster and use SSH to connect your cluster.)

    • Step type : Spark application
      (configure)
      Name : labelp
      Deploy mode : cluster
      Spark-submit options : -- master yarn --driver-memory 4g --executor-memory 2g . Without setting memory, application may fail for memoryoverhead. (For more details : Running Spark on Yarn)
      Application location : choose labelp.py in your S3 bucket
      Action on failure : Terminate cluster (Recommended)

    • Vendor : Amazon, Release : emr-5.2.0 (If you choose Cluster mode before, choose Application : Spark: Spark 2.0.2....)

    • Instance type : m1.large
      Number of instances : 4
      (You can use other type and number of instances, but make sure that your total memory is larger than 13.91G, which was observed as the maximum memory used during the process.)
      (The whole time for computation was about 6 hour and 43 mins.)

    • Permission : Default
      (If you choose Cluster mode before, upload your public key to AWS and select it here)

    • Click Create cluster button.
      Done!
      Your cluster will be terminated automatically after the application is finished.

Where to find the dataset I used

Dataset : Twitter
R. Zafarani and H. Liu, (2009). Social Computing Data Repository at ASU [http://socialcomputing.asu.edu]. Tempe, AZ: Arizona State University, School of Computing, Informatics and Decision Systems Engineering

Results

Output format : ('Label',u'CommunitySize/Members')
See output-refined version.

Details of Algorithm

Please read Algorithm Instruction.pdf.

Want a Pseudo distributed version to test small datasets?

Please see pseudo mode.

Dataset in other format?

With a dataset using (space) or (tab) to separate follower and user,
change the following position:

  • (line 11 & 51) y=x.split(',') --> y=x.split()

Others

If you have any question or suggestion, please contact [email protected] or [email protected] .
Thanks!

Linghao Li
12/19/2016

About

A Spark application, written in Python, to figure out strongly connected components with Bi-directional Label Propagation algorithm. This project implemented an 1.3GB Twitter network dataset on AWS EMR cluster.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages