simple sentence clusterer
A tool to cluster lines of text in a streaming fashion.
E.g., if you have a file like this, with one sentence per line:
$ cat /tmp/foo
Hi there, how are you?
hi where how you are
i like to sing
I am going to sing
hi where how you are
hi there how...
do you sing???
You can pipe it to sclust
to get cluster assignments:
$ cat /tmp/foo | sclust
0 Hi there, how are you? -
1 hi where how you are -
2 i like to sing -
2 I am going to sing 0.298455
1 hi where how you are 0.336248
0 hi there how... 0.206029
3 do you sing??? -
Here, the first column is the cluster assignment and the third column is the cosine similarity between that line and the cluster it is assigned to. The algorithm is online (as opposed to batch), so order matters. At each iteration, a document is assigned to its closest cluster, according to cosine similarity.
Other options:
$ sclust --help
A command-line tool to quickly cluster sentences.
usage:
sclust [--help --threshold <T> --update-norms <N> --prune-frequency <P>]
Options
-h, --help
-p, --prune-frequency <P> Delete small clusters every P lines [default: -1]
-t, --threshold <N> Similarity threshold in [0,1]. Higher means sentences must be more similar to be merged. [default: .2]
There is also a tool sclust-summarize
to view the output.
$ sclust-summarize --help
A command-line tool to print the top clusters output by sclust in a streaming fashion.
E.g., cat data.txt | sclust | sclust-summarize
usage:
sclust-summarize [--help --frequency <F> --num-docs-to-print <N> --num-clusters-to-print <K>]
Options
-h, --help
-f, --frequency <F> Print clusters every F lines [default: 1000]
-n, --num-clusters-to-print <N> Number of top clusters to print [default: 10]
-k, --num-docs-to-print <K> Number of documents per cluster to print [default: 3]
E.g.,
$ cat /tmp/foo | sclust | sclust-summarize -k 3
---------7 documents, 4 clusters---------
2 0 Hi there, how are you? -
hi there how... 0.206029
2 1 hi where how you are -
hi where how you are 0.336248
2 2 i like to sing -
I am going to sing 0.298455
1 3 do you sing??? -
Here, the first column is cluster frequency, the second column is cluster id. You can optionally print the most recent k
documents added to the cluster.
The nice thing is you can pipe a large file and see the clusters change as documents are read. E.g.,
$ cat /tmp/a_bunch_of_tweets | sclust -t .2 -u 1000 | sclust-summarize -k 2 -n 5
---------1000 documents, 715 clusters---------
15 68 lol 0.73664
i need matching jewelry me lj lol 0.215373
9 93 if you love watching huge zits getting popped then you re gonna love this it like therapeutic for me webaddress 0.273418
i love jesus exclamationpoint 0.314281
9 324 no one going to pay for content what they used to pay doesn t matter what it is 0.29804
what 0.646988
9 14 studing exclamationpoint studing exclamationpoint oh sh 0.23397
oh yeah this woman knows what she talkin about exclamationpoint 0.232043
7 160 rt 0.615143
rt plz rt fast waystation needs help evacuating animals exclamationpoint now exclamationpoint little tujunga canyon rt 0.269342
---------2000 documents, 1209 clusters---------
21 98 comendo happyemoticon 0.281718
we are better now and for good this time happyemoticon i love her happyemoticon 0.378463
19 68 lol who da hell u talkin bout 0.202604
see lol 0.448003
17 93 i love it here 0.512499
i love target and my lovelies 0.254959
16 14 oh yeah exclamationpoint lol 0.566714
uh oh 0.391978
15 135 lol i love hey arnold 0.642059
hey diggy 0.357601
---------3000 documents, 1621 clusters---------
29 93 i love you baby sister 0.349325
i love you diva o 0.366829
28 98 and im luvin it happyemoticon 0.263552
heyy everyone happyemoticon how are ya 0.225831
27 68 lol well i txt her just put jessica hahaha 0.207772
my bad i was falling myself lol 0.262167
25 135 and ahh exclamationpoint i love hey arnold exclamationpoint 0.599761
hey happyemoticon its exclamationpoint 0.53119
21 14 oh psych how you make my day 0.335252
oh my god oh my god oh my god 0.556064