A Spark application, written in Python3, to figure out strongly connected components with Bi-directional Label Propagation algorithm.
This project implemented an 1.3GB Twitter network dataset by AWS EMR cluster.
-
Upload labelp.py and dataset to your bucket in AWS S3. (if you already have an AWS account)
-
Create a cluster in AWS EMR.
-
Launch mode :
Step execution
(You can also chooseCluster
and useSSH
to connect your cluster.) -
Step type :
Spark application
(configure)
Name :labelp
Deploy mode :cluster
Spark-submit options :-- master yarn --driver-memory 4g --executor-memory 2g
. Without setting memory, application may fail for memoryoverhead. (For more details : Running Spark on Yarn)
Application location :choose labelp.py in your S3 bucket
Action on failure :Terminate cluster
(Recommended) -
Vendor :
Amazon
, Release :emr-5.2.0
(If you chooseCluster
mode before, choose Application :Spark: Spark 2.0.2...
.) -
Instance type :
m1.large
Number of instances :4
(You can use other type and number of instances, but make sure that your total memory is larger than 13.91G, which was observed as the maximum memory used during the process.)
(The whole time for computation was about 6 hour and 43 mins.) -
Permission :
Default
(If you chooseCluster
mode before, upload your public key to AWS and select it here) -
Click
Create cluster
button.
Done!
Your cluster will be terminated automatically after the application is finished.
-
Dataset : Twitter
R. Zafarani and H. Liu, (2009). Social Computing Data Repository at ASU [http://socialcomputing.asu.edu]. Tempe, AZ: Arizona State University, School of Computing, Informatics and Decision Systems Engineering
Output format : ('Label',u'CommunitySize/Members')
See output-refined version.
Please read Algorithm Instruction.pdf.
Please see pseudo mode.
With a dataset using (space) or (tab) to separate follower and user,
change the following position:
- (line 11 & 51)
y=x.split(',') --> y=x.split()
If you have any question or suggestion, please contact [email protected] or [email protected] .
Thanks!
Linghao Li
12/19/2016