nbayesfilter

Naive bayes filter for Korean badchars. Written by Hyun Joon Seol ([email protected]).

The Problem

Typographical errors are one of the most common problems in data driven processing in all languages. Unfortunately, Korean suffers from this error too. In a large corpus, even if a small portion is erroneous it can constitute a big problem if their counts are large enough to make it into the lexicon after pruning. It takes the place for less searched-for queries and may hinder from delivering correct results in a query search scenario. This project aims to detect these bad characters with a data-driven Bernoulli Naive Bayes methodology. It uses scikit-learn package and approaches the problem with character-based bigrams.

Training Corpus

The training corpus, named correct.txt and error.txt contains manually checked queries that seem to be wrong but are correct, and queries that seem to be wrong and is acutally wrong (with the help of two part-time employees). The initial training set before labeling comes from a quick script that detects uncommon characters in Korean (double end phonemes, uncommon dipthongs, etc). These characters are defined in the file bad.py.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
bad.py		bad.py
correct.txt		correct.txt
error.txt		error.txt
nb.py		nb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nbayesfilter

The Problem

Training Corpus

About

Releases

Packages

Languages

tglstory/nbayesfilter

Folders and files

Latest commit

History

Repository files navigation

nbayesfilter

The Problem

Training Corpus

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages