Skip to content

xc-jp/sphinx-tsegsearch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

sphinx-tsegsearch

A Sphinx extension for tokenize japanese query word with TinySegmenter.js

This extension tweaks searchtools.js of sphinx-generated html document to tokenize Japanese composite words.

Since Japanese is an agglutinative language, query word for document search usually becomes composite form like 'システム標準' (system standard). This makes difficult to search pages containing phrase such as 'システムの標準', '標準システム', because TinySegmenter.py (Sphinx's default Japanese index tokenizer) tokenizes 'システム' and '標準' as indexes.

sphinx-tsegsearch patches searchtools.js to override query tokinization step so that query input is re-tokenized by TinySegmenter.js (original JavaScript implementation of TinySegmenter). As a result, roughly say, this tiny hack improves recall of Japanese document search in exchange of precision.

Usage:

  1. Add 'sphinx_tsegsearch' in conf.extensions
  2. Rebuild document.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 80.9%
  • Python 17.1%
  • HTML 2.0%