Skip to content

Command line scripts and callable Taggers

Compare
Choose a tag to compare
@polm polm released this 18 May 15:58
· 159 commits to master since this release

This isn't a drastic release, but since I've been dragging out the patch numbers it seemed like a good time to bump the minor version. This is v0.2.0! 🎉

The first feature in this release is the addition of command line scripts. Since it's possible to install fugashi without MeCab, you might not have a command-line binary. This fixes that so you can use fugashi as a replacement for mecab. There's also the fugashi-info script, which is similar to mecab -D in that it prints dictionary information. I hope it will be useful when dealing with bugs and installation issues.

The other feature is that Tagger instances are now callable. One of the best features of fugashi is it makes it much easier to work with MeCab nodes, but the function associated with that - parseToNodeList - had an unfortunately long name. I didn't want to call it parse since that already has meaning in MeCab, but giving it a different name felt odd... so I realized the easiest thing is to make the Tagger instance itself callable. Here's an example of the change this makes possible:

from fugashi import Tagger
tagger = Tagger()

# before
for word in tagger.parseToNodeList(text):
    print(word.surface)

# after
for word in tagger(text):
    print(word.surface)

Feels better, doesn't it? I imagine this will be particularly helpful for compact expressions like list comprehensions. And parseToNodeList is still there, so existing code can be used unmodified.

Lately I've been working more on optimizing SudachiPy than fugashi, but there are still ease-of-use improvements to be made here, and if it works here it can be useful in other tokenizers too. If there's anything you'd like to see let me know.