-
Notifications
You must be signed in to change notification settings - Fork 75
Aho Corasick String Matching Algorithm
Aho-Corasick Algorithm is a dictionary-matching algorithm to output all matched dictionary entries with a single pass on the input text.
Wiki Page: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
The linear time complexity can help to speed up the process of dictionary matcher on substring matching type.
Given all the dictionary entries as the input, Dictionary Matcher firstly preprocess them to construct an automaton once and save for later data stream to match.
The automaton starts with a prefix trie of all the words to be matched.
Then three main functions including success transactions, failure transactions and output matching words for each trie node will be set up using bread first search traversal on the trie:
The success transactions follows the edges in the trie to find the children of current trie node.
The failure transactions set up links between failed string matches and the nodes on other branches which share the longest common suffix.
The output list stores all the words ending at current node and its failure node.
When executing on the input text, the algorithm traverse the graph by firstly following the success transactions to the child node, if it doesn't exist, then following the failure transactions to its proper suffix node.
When the algorithm reaches a node with a non-empty output keyword list, it outputs all the matched dictionary entires that ends at current character position in the input text.