Releases: polm/fugashi
Version 0.1.10: Python 3.5+ and other features
This release includes a number of small fixes from 0.1.9 and two more significant changes.
Unidic 26 Field Format Support
Unidic has a surprising variety of formats, and the 26-field variety wasn't previously supported. This format includes kana accent information and is notably used in binary distribution of Unidic 2.1.2.
Support for Python 3.5, 3.6
Support for these versions was initially removed due to their short remaining lifespan and lack of a default
option in the namedtuple
constructor. @tamuhey made the necessary changes to get them working so they're supported for now; thanks!
Other Changes
- dummy mecabrc specification for bundled Unidic support (still a work in progress)
- test fixes and documentation
- deal with comma separate values inside fields
Upcoming Changes
I'm working on creating a bundled version of Unidic. Modern versions of Unidic are too large to distribute via PyPI, so I'm figuring out the best way to distribute the data.
Generic Dictionary Support in v0.1.8
v0.1.8 of fugashi adds support for generic dictionaries. You can now use IPADic or other dictionaries by using a GenericTagger the same way you would use the normal Tagger:
import fugashi
tagger = fugashi.GenericTagger('-d/usr/local/lib/mecab/dic/ipadic')
It's also possible to specify dictionary fields so you can get convenient access to features no matter what dictionary you use.
import fugashi
# the wrapper is just a namedtuple with a default value of None for all fields
MyDictFeatures = fugashi.create_dict_wrapper('MyDictFeatures', 'lemma alpha beta'.split())
tagger = fugashi.GenericTagger('-d/usr/local/lib/mecab/dic/customdic', MyDictFeatures)
nodes = tagger.parseToNodes('blah blah')
node = nodes[0]
print(node.lemma, node.alpha, node.beta)
Some other changes:
- the raw feature string is now available as
.feature_raw
on nodes - packaging-related fixes
- initial mecab-ko-dic (Korean) support; needs more testing