These scripts to build a Lucene Kuromoji or Atilika Kuromoji with bundled mecab-ipadic-NEologd.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Kuromoji is morphological analyzer which is included in Apache Lucene.
Kuromoji is an open source Japanese morphological analyzer written in Java.
mecab-ipadic-NEologd : Neologism dictionary for MeCab
Note: These build scripts are supporting is only IPA dictionary.
Lucene Kuromoji: 4.x, 5.x, 6.x, 7.x, 8.x
Atilika Kuromoji: 0.9.0
To use this script, you must install the following software.
- JDK (Oracle JDK / OpenJDK)
- Apache Ant (for Lucene Kuromoji)
- Apache Maven (for Atilika Kuromoji)
- Git
- make
- curl
- iconv
- xz
- Perl
- MeCab (mecab/mecab-config) (optional, auto install)
- C++ Compiler (when MeCab is installed automatically)
Note: Many CPU and memory resource are used by a build. About 5-6 GB of JavaVM heap is needed at present.
$ git clone https://github.com/kazuhira-r/kuromoji-with-mecab-neologd-buildscript
or
$ wget https://raw.githubusercontent.com/kazuhira-r/kuromoji-with-mecab-neologd-buildscript/master/build-lucene-kuromoji-with-mecab-ipadic-neologd.sh
Please to grant execute permissions.
$ chmod a+x build-lucene-kuromoji-with-mecab-ipadic-neologd.sh
In any directory, please run the script.
$ /path/to/build-lucene-kuromoji-with-mecab-ipadic-neologd.sh
The setting when execute, is indicated.
### [2016-12-18 17:57:02] [main] [INFO] START.
####################################################################
applied build options.
[Auto Install MeCab Version ] ... mecab-0.996
[mecab-ipadic-NEologd Tag (-N)] ... master
*** deprecated option ***
[install adjective ext (-T)] ... 0
*** deprecated option ***
[Max BaseForm Length ] ... 15
[Lucene Version Tag (-L)] ... releases/lucene-solr/6.3.0
[Kuromoji build Max Heapsize (-M)] ... 6g
[Kuromoji JAR File Output Directory Name (-o)] ... .
[Kuromoji Package Name (-p)] ... org.apache.lucene.analysis.ja
####################################################################
That were built JAR file will be created in user specified directory (default: current directory) where you run the script.
$ ls -l
total 51832
-rw-rw-r-- 1 xyz xyz 51655324 Dec 18 18:05 lucene-analyzers-kuromoji-ipadic-neologd-6.3.0-20161215.jar
drwxrwxr-x 6 xyz xyz 4096 Dec 18 18:02 lucene-solr
drwxrwxr-x 8 xyz xyz 4096 Jul 23 00:32 mecab
drwxr-xr-x 8 xyz xyz 4096 Jul 23 00:31 mecab-0.996
-rw-rw-r-- 1 xyz xyz 1398663 Jul 23 00:31 mecab-0.996.tar.gz
drwxrwxr-x 9 xyz xyz 4096 Dec 18 17:59 mecab-ipadic-neologd
In this case, it is "lucene-analyzers-kuromoji-ipadic-neologd-6.3.0-20161215.jar" JAR file that was built.
Naming of a JAR file of a build result is as follows.
naming:
lucene-analyzers-kuromoji-ipadic-neologd-[Lucene Version]-[mecab-ipadic-NEologd dictionary date].jar
example:
lucene-analyzers-kuromoji-ipadic-neologd-6.3.0-20161215.jar
- -N - branch or tag name in mecab-ipadic-NEologd, included in a build. default: master
- ***deprecated*** -T - install adjective ext. if you want enable, specified 1. default: 0
- -L - branch or tag name in Apache Lucene of a build target. default: current Apache Lucene latest release tag.
- -M - Kuromoji build max heapsize.
- -o - generated Kuromoji JAR file output directory. (default: . (current directory))
- -p - package name at the time of a build. default: org.apache.lucene.analysis.ja (original package)
$ git clone https://github.com/kazuhira-r/kuromoji-with-mecab-neologd-buildscript
or
$ wget https://raw.githubusercontent.com/kazuhira-r/kuromoji-with-mecab-neologd-buildscript/master/build-atilika-kuromoji-with-mecab-ipadic-neologd.sh
Please to grant execute permissions.
$ chmod a+x build-atilika-kuromoji-with-mecab-ipadic-neologd.sh
In any directory, please run the script.
$ /path/to/build-atilika-kuromoji-with-mecab-ipadic-neologd.sh
The setting when execute, is indicated.
### [2016-12-18 23:10:54] [main] [INFO] START.
####################################################################
applied build options.
[Auto Install MeCab Version ] ... mecab-0.996
[mecab-ipadic-NEologd Tag (-N)] ... master
*** deprecated option ***
[install adjective ext (-T)] ... 0
*** deprecated option ***
[Kuromoji Version Tag (-K)] ... 0.9.0
[Kuromoji build Max Heapsize (-M)] ... 7g
[Kuromoji JAR File Output Directory Name (-o)] ... .
[Kuromoji Package Name (-p)] ... com.atilika.kuromoji.ipadic
####################################################################
That were built JAR file will be created in user specified directory (default: current directory) where you run the script.
$ ls -l
total 133572
drwxrwxr-x 10 xyz xyz 4096 Dec 18 23:13 kuromoji
-rw-rw-r-- 1 xyz xyz 135352388 Dec 18 23:33 kuromoji-ipadic-neologd-0.9.0-20161215.jar
drwxrwxr-x 8 xyz xyz 4096 Dec 18 22:39 mecab
drwxr-xr-x 8 xyz xyz 4096 Dec 18 22:39 mecab-0.996
-rw-rw-r-- 1 xyz xyz 1398663 Jul 23 00:32 mecab-0.996.tar.gz
drwxrwxr-x 9 xyz xyz 4096 Dec 18 23:11 mecab-ipadic-neologd
In this case, it is "kuromoji-ipadic-neologd-0.9.0-20161215.jar" JAR file that was built.
Naming of a JAR file of a build result is as follows.
naming:
kuromoji-ipadic-neologd-[Atilika Kuromoji Version]-[mecab-ipadic-NEologd dictionary date].jar
example:
kuromoji-ipadic-neologd-0.9.0-20161215.jar
- -N - branch or tag name in mecab-ipadic-NEologd, included in a build. default: master
- ***deprecated*** -T - install adjective ext. if you want enable, specified 1. default: 0
- -K - branch or tag name in Atilika Kuromoji of a build target. default: current Atilika Kuromoji latest release tag.
- -M - Kuromoji build max heapsize.
- -o - generated Kuromoji JAR file output directory. (default: . (current directory))
- -p - package name at the time of a build. default: com.atilika.kuromoji.ipadic (original package)
This script, perform the following processing.
- Check the installation of MeCab, Installing MeCab in the current directory unless MeCab is not installed
- Clone mecab-ipadic-NEologd
- Generate a dictionary CSV(using libexec/make-mecab-ipadic-neologd.sh -L)
- Clone Apache Lucene or Atilika Kuromoji source code
- (Lucene Kuromoji only) Edit Apache Lucene Kuromoji's build.xml
- Rename package name, when being necessary
- Build Kuromoji and dictionary with mecab-ipadic-NEologd
- Copy JAR file to specified directory (default: current directory)
Copyright © 2015, 2016, 2017, 2018, 2019 kazuhira-r
Licensed under the Apache License, Version 2.0