Hive udf utility method to do fuzzy string matching for two strings using Jaro Winkler (JW), Levensteing (LV) or Ngram (NG) distance.
fuzzy_match udf method is a wrapper of matching distance calculus available in lucene spell checker package :
This projet provides an implementation example of Hive GenericUDF
Param 1 : First string to match.
Param 2 : Second string to match with the first one.
Param 3 : Algo to be used in matching : JW, LV or NG.
Return : Double, the distance separating the two string
fuzzy_match is a maven projet so building and installing it is straightforward with a mvn clean install
The task will build a fat jar including all the dependencies of the fuzzy_match udf
-
Put the jar
fuzzy_text-1.0-SNAPSHOT.jar
in your home directory, in my case/home/ych/fuzzy_match
-
In your hive script or shell add the following two lignes :
add jar /home/ych/fuzzy_match/fuzzytext-1.0-SNAPSHOT-fat.jar; CREATE TEMPORARY FUNCTION fuzzy_match as 'com.ych.fuzzytext.hive.udf.FuzzyMatch';'
Start using fuzzy_match select a, b, fuzzy_match(a,b,"JW") from mytable