Distributed K-means algorithm for clustering question-answers on StackOverFlow
The aim is to compare distribution of the question-answers on StackOverFlow data for following list of programming languages with help of K-means algorithm with Scala 3 and Apache Spark distributed computing.
val langs =
List(
"JavaScript",
"Java",
"PHP",
"Python",
"C#",
"C++",
"Ruby",
"CSS",
"Objective-C",
"Perl",
"Scala",
"Haskell",
"MATLAB",
"Clojure",
"Groovy"
)
You need to have JDK 11 or higher and SBT build tool installed on your machine
You can check for Java like follow:
java --version
You might see something like below:
openjdk 11.0.12 2021-07-20
OpenJDK Runtime Environment 18.9 (build 11.0.12+7)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7, mixed mode)
for installing sbt visit sbt reference manual
At the root of the project run sbt
sbt run
after a couple of seconds or more the result of the iterations and also the final clustering will be printed to console.
[info] Resulting clusters:
[info] Score Dominant language (%percent) Questions
[info] ================================================
[info] 1546 Java (100.0%) 8
[info] 1432 JavaScript (100.0%) 27
[info] 722 Python (100.0%) 34
[info] 586 C++ (100.0%) 19
[info] 574 Ruby (100.0%) 14
[info] 548 Objective-C (100.0%) 30
[info] 491 CSS (100.0%) 28
[info] 485 C# (100.0%) 63
[info] 465 PHP (100.0%) 34
[info] 289 Perl (100.0%) 1
[info] 279 JavaScript (100.0%) 300
[info] 266 Scala (100.0%) 3
[info] 182 Haskell (100.0%) 7
[info] 180 Java (100.0%) 395
[info] 153 Python (100.0%) 326
[info] 141 CSS (100.0%) 226
[info] 122 C++ (100.0%) 272
[info] 120 Ruby (100.0%) 199
[info] 97 Objective-C (100.0%) 408
[info] 81 C# (100.0%) 1228
[info] 72 Clojure (100.0%) 26
[info] 71 PHP (100.0%) 606
[info] 60 Scala (100.0%) 104
[info] 47 MATLAB (100.0%) 27
[info] 46 Groovy (100.0%) 10
[info] 35 Haskell (100.0%) 175
[info] 32 Perl (100.0%) 179
[info] 18 Clojure (100.0%) 180
[info] 9 Groovy (100.0%) 190
[info] 7 MATLAB (100.0%) 888
[info] 4 Haskell (100.0%) 4903
[info] 3 Scala (100.0%) 6312
[info] 2 Perl (100.0%) 11532
[info] 2 Python (100.0%) 85751
[info] 2 Clojure (100.0%) 1789
[info] 2 C++ (100.0%) 88910
[info] 1 PHP (100.0%) 155254
[info] 1 C# (100.0%) 177686
[info] 1 Ruby (100.0%) 26769
[info] 1 CSS (100.0%) 55438
[info] 1 Java (100.0%) 188364
[info] 1 JavaScript (100.0%) 179390
[info] 1 MATLAB (100.0%) 6213
[info] 1 Groovy (100.0%) 1310
[info] 1 Objective-C (100.0%) 46504