Skip to content
/ simphon Public

Proof-of-concept for measuring similarity of phoneme sequences using locality sensitive hashing (LSH).

License

Notifications You must be signed in to change notification settings

zyocum/simphon

Repository files navigation

simphon

Proof-of-concept for measuring similarity of phoneme sequences using locality sensitive hashing (LSH).

Setup

Requires Python 3.10+

To install dependencies:

$ python3 -m venv simphon
$ source simphon/bin/activate
(simphon) $ pip3 install -U pip
(simphon) $ pip3 install -r requirements.txt

To use the virtual environment in the Jupyter notebook, run:

(simphon) $ ipython kernel install --user --name=simphon
(simphon) $ jupyter notebook simphon.ipynb

Then, choose the kernel with the name of the virtual environment:

Select the "simphon" kernel

Using simphon.py

There is a CLI you can run from simphon.py:

$ ./simphon.py -h
usage: simphon.py [-h] [-b BIT_SIZE] [-w WINDOW_SIZE] [-s SEED] [-q QUERIES [QUERIES ...]]

Script with a CLI to show a simple demonstration of phonemic token matching. The similarity of tokens is computed via simhashing of 2D matrices formed by stacking PHOIBLE phoneme feature vectors for sequences of phonemes.

options:
  -h, --help            show this help message and exit
  -b BIT_SIZE, --bit-size BIT_SIZE
                        LSH bit size (default: 256)
  -w WINDOW_SIZE, --window-size WINDOW_SIZE
                        sliding window size for comparing pairs in the list of pairs sorted by LSH bitwise difference (default: 2)
  -s SEED, --seed SEED  an integer seed value that will be added to the underlying hash seed (default: 0)
  -q QUERIES [QUERIES ...], --queries QUERIES [QUERIES ...]
                        query for similar tokens (default: [Token(language='eng', graphemes='Rafael Nadal', phonemes=('|', 'ɹ', 'ɑ', 'f', '', 'l', '|', 'n', 'ə', 'd', 'ə', 'l', '|'))])

Example output formatted with tabulate:

(simphon) $ ./simphon.py | tabulate -1s $'\t'                                                                                    
comparing pairs: 12088pair [02:44, 73.68pair/s]  
       a                                                                            b                                                                              simhash difference (in bits)    similarity score
-----  ---------------------------------------------------------------------------  ---------------------------------------------------------------------------  ------------------------------  ------------------
    0  (jpn) 〈かつら〉 /| k a tsː ɯ d̠ a |/                                         (jpn) 〈カツラ〉 /| k a tsː ɯ d̠ a |/                                                                      0               1
    1  (eng) 〈Zach〉 /| z æ k |/                                                   (eng) 〈Zack〉 /| z æ k |/                                                                                0               1
    2  (eng) 〈Kevin Rodney Sullivan〉 /| k ɛ v ɪ n | r ɑ d n i | s ʌ l v ə n |/    (eng) 〈Kevin Rodney Sullivan〉 /| k ɛ v ɪ n | r ɑ d n i | s ʌ l v ɪ n |/                               287               0.907
    3  (eng) 〈Don Scardino〉 /| d ɑ n | s k ɑ ɹ d i n oʊ |/                        (eng) 〈Don Scardino〉 /| d ɑ n | s k ɑ ɹ d ɪ n oʊ |/                                                   322               0.895
    4  (eng) 〈Adam Bernstein〉 /| æ d ə m | b ɛ r n s t aɪ n |/                    (eng) 〈Adam Bernstein〉 /| æ d ə m | b ɛ r n s t i n |/                                                328               0.893
    5  (eng) 〈Dominic Mitchell〉 /| d ɑ m ɪ n ɪ k | m ɪ t̠ʃ ə l |/                  (eng) 〈Dominique Mitchell〉 /| d ɑ m ɪ n i k | m ɪ t̠ʃ ə l |/                                           334               0.891
    6  (eng) 〈Denise Thé〉 /| d ə n ɪ s | t eɪ |/                                  (eng) 〈Denise Thé〉 /| d ə n ɪ z | t eɪ |/                                                             348               0.887
    7  (eng) 〈Denise Thé〉 /| d ə n ɪ s | θ eɪ |/                                  (eng) 〈Denise Thé〉 /| d ə n ɪ z | θ eɪ |/                                                             397               0.871
    8  (eng) 〈Denise Thé〉 /| d ə n ɪ z | t eɪ |/                                  (eng) 〈Denise Thé〉 /| d ə n ɪ z | θ eɪ |/                                                             431               0.86
    9  (eng) 〈Daniel T. Thomsen〉 /| d æ n j ə l | t i | t ɑ m s ə n |/            (eng) 〈Daniel Thomsen〉 /| d æ n j ə l | t ɑ m s ə n |/                                                432               0.859
   10  (eng) 〈Denise Thé〉 /| d ə n ɪ s | t eɪ |/                                  (eng) 〈Denise Thé〉 /| d ə n ɪ s | θ eɪ |/                                                             452               0.853
   11  (heb) 〈צוֹפִית〉 /| ts o f i t |/                                             (heb) 〈צוּפִית〉 /| ts u f i t |/                                                                        460               0.85
   12  (eng) 〈Jenny〉 /| d̠ʒ ɛ n i |/                                               (eng) 〈Johnny〉 /| d̠ʒ ɑ n i |/                                                                         500               0.837
   13  (eng) 〈Colleen McGuinness〉 /| k ɑ l ɪ n | m ə k ɪ n ə s |/                 (eng) 〈Colleen McGuinness〉 /| k ɑ l ɪ n | m ə ɡ w ɪ n ə s |/                                          504               0.836
   14  (eng) 〈Anna Foerster〉 /| æ n ə | f ɔ ɹ s t ɹ |/                            (eng) 〈Anna Foster〉 /| æ n ə | f ɔ s t ɹ |/                                                           528               0.828
   15  (eng) 〈Dave Finkel〉 /| d eɪ v | f ɪ ŋ k ə l |/                             (eng) 〈David Finkel〉 /| d eɪ v ɪ d | f ɪ ŋ k ə l |/                                                   540               0.824
   16  (eng) 〈Zachariah〉 /| z æ k ə ɹ aɪ ə |/                                     (eng) 〈Zachary〉 /| z æ k ə ɹ i |/                                                                     557               0.819
   17  (eng) 〈Amit〉 /| ɑ m i t |/                                                 (heb) 〈עמת〉 /| a m ɪ t |/                                                                             558               0.818
   18  (eng) 〈Zach〉 /| z æ k |/                                                   (heb) 〈זך〉 /| z a k |/                                                                                559               0.818
   19  (eng) 〈Zack〉 /| z æ k |/                                                   (heb) 〈זך〉 /| z a k |/                                                                                559               0.818
   20  (eng) 〈Andrew Seklir〉 /| æ n d ɹ u | s ɛ k l ə ɹ |/                        (eng) 〈Andrew Seklir〉 /| æ n d ɹ u | s ɛ k l ɪ ɹ |/                                                   565               0.816
   21  (eng) 〈Brett〉 /| b ɹ ɛ t |/                                                (eng) 〈Brett Baer〉 /| b ɹ ɛ t | b ɛ ɹ |/                                                              567               0.815
   22  (eng) 〈Adam Bernstein〉 /| æ d ə m | b ɛ r n s t i n |/                     (eng) 〈Bernstein, Adam〉 /| b ɛ r n s t i n | æ d ə m |/                                               568               0.815
   23  (eng) 〈Sophia〉 /| s oʊ f i ə |/                                            (eng) 〈Tsofit〉 /| s oʊ f i t |/                                                                       580               0.811
   24  (jpn) 〈ただよし〉 /| t a d a j o s i |/                                     (jpn) 〈まさよし〉 /| m a s a j o s i |/                                                                580               0.811
   25  (heb) 〈עמת〉 /| a m ɪ t |/                                                  (hin) 〈अमित〉 /| aː m ɪ t̪ |/                                                                           581               0.811
...

About

Proof-of-concept for measuring similarity of phoneme sequences using locality sensitive hashing (LSH).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published