sclust

simple sentence clusterer

A tool to cluster lines of text in a streaming fashion.

E.g., if you have a file like this, with one sentence per line:

$ cat /tmp/foo
Hi there, how are you?
hi where how you are
i like to sing
I am going to sing
hi where how you are
hi there how...
do you sing???

You can pipe it to sclust to get cluster assignments:

$ cat /tmp/foo | sclust
0	Hi there, how are you?	-
1	hi where how you are	-
2	i like to sing	-
2	I am going to sing	0.298455
1	hi where how you are	0.336248
0	hi there how...	0.206029
3	do you sing???	-

Here, the first column is the cluster assignment and the third column is the cosine similarity between that line and the cluster it is assigned to. The algorithm is online (as opposed to batch), so order matters. At each iteration, a document is assigned to its closest cluster, according to cosine similarity.

Other options:

$ sclust --help
A command-line tool to quickly cluster sentences.

usage:
    sclust [--help --threshold <T> --update-norms <N> --prune-frequency <P>]

Options
    -h, --help
    -p, --prune-frequency <P>   Delete small clusters every P lines [default: -1]
    -t, --threshold <N>         Similarity threshold in [0,1]. Higher means sentences must be more similar to be merged. [default: .2]

There is also a tool sclust-summarize to view the output.

$ sclust-summarize --help

A command-line tool to print the top clusters output by sclust in a streaming fashion.
E.g., cat data.txt | sclust | sclust-summarize

usage:
    sclust-summarize [--help --frequency <F> --num-docs-to-print <N> --num-clusters-to-print <K>]

Options
    -h, --help
    -f, --frequency <F>               Print clusters every F lines [default: 1000]
    -n, --num-clusters-to-print <N>   Number of top clusters to print [default: 10]
    -k, --num-docs-to-print <K>       Number of documents per cluster to print [default: 3]

E.g.,

$ cat /tmp/foo | sclust | sclust-summarize  -k 3

---------7 documents, 4 clusters---------

2	0	Hi there, how are you?	-
 	 	hi there how...	0.206029
2	1	hi where how you are	-
 	 	hi where how you are	0.336248
2	2	i like to sing	-
 	 	I am going to sing	0.298455
1	3	do you sing???	-

Here, the first column is cluster frequency, the second column is cluster id. You can optionally print the most recent k documents added to the cluster.

The nice thing is you can pipe a large file and see the clusters change as documents are read. E.g.,

$  cat /tmp/a_bunch_of_tweets  | sclust -t .2 -u 1000 | sclust-summarize -k 2 -n 5 

---------1000 documents, 715 clusters---------

15	68	lol	0.73664
 	 	i need matching jewelry me lj lol	0.215373
9	93	if you love watching huge zits getting popped then you re gonna love this it like therapeutic for me webaddress	0.273418
 	 	i love jesus exclamationpoint	0.314281
9	324	no one going to pay for content what they used to pay doesn t matter what it is	0.29804
 	 	what	0.646988
9	14	studing exclamationpoint studing exclamationpoint oh sh	0.23397
 	 	oh yeah this woman knows what she talkin about exclamationpoint	0.232043
7	160	rt	0.615143
 	 	rt plz rt fast waystation needs help evacuating animals exclamationpoint now exclamationpoint little tujunga canyon rt	0.269342

---------2000 documents, 1209 clusters---------

21	98	comendo happyemoticon	0.281718
 	 	we are better now and for good this time happyemoticon i love her happyemoticon	0.378463
19	68	lol who da hell u talkin bout	0.202604
 	 	see lol	0.448003
17	93	i love it here	0.512499
 	 	i love target and my lovelies	0.254959
16	14	oh yeah exclamationpoint lol	0.566714
 	 	uh oh	0.391978
15	135	lol i love hey arnold	0.642059
 	 	hey diggy	0.357601

---------3000 documents, 1621 clusters---------

29	93	i love you baby sister	0.349325
 	 	i love you diva o	0.366829
28	98	and im luvin it happyemoticon	0.263552
 	 	heyy everyone happyemoticon how are ya	0.225831
27	68	lol well i txt her just put jessica hahaha	0.207772
 	 	my bad i was falling myself lol	0.262167
25	135	and ahh exclamationpoint i love hey arnold exclamationpoint	0.599761
 	 	hey happyemoticon its exclamationpoint	0.53119
21	14	oh psych how you make my day	0.335252
 	 	oh my god oh my god oh my god	0.556064

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
docs		docs
sclust		sclust
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README.rst		README.rst
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini
travis_pypi_setup.py		travis_pypi_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sclust

About

Releases

Packages

Contributors 2

Languages

License

tapilab/sclust

Folders and files

Latest commit

History

Repository files navigation

sclust

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages