Query Rewriter

Authors: Shiladitya Sen, Kishore Narendran

Reviewer: Chen Li (DONE)

Synopsis

The purpose of the "QueryRewriter" operator is to correct errors of missing spaces in a query that can lead to incorrect tokenization. For instance, a query "newyork" can be rewritten by this operator to "new york". The operator is be used to return:

The most likely rewritten query found using a word-frequency dictionary; or
A set of valid rewritten queries.

Status

As of 6/3/2016: COMPLETED

Modules

edu.uci.ics.texera.dataflow.queryrewriter

Related Issues

Design: Query Rewriter Issue - https://github.com/Texera/texera/issues/29

Description

The operator inserts spaces to a query string to find likely words in order to rewrite the query. It has two implementations:

A dynamic programming algorithm that utilizes a word-frequency dictionary to find the most likely tokenization. This algorithm was adopted from the Chinese characters tokenization performed in the [Srch2 Chinese Tokenization] module (https://github.com/SRCH2/srch2-ngn/blob/master/src/core/analyzer/ChineseTokenizer.cpp#L197). The word-frequency dictionary was derived from Google unigrams and the NLTK English dictionary. The score for each word used for the algorithm is a reciprocal of frequency.
A recursive algorithm that uses an English dictionary (possibly without word frequencies) to find all combinations of valid tokenizations in a search string. This algorithm that can be found here