Skip to content

a hive udf method to do fuzzy string matching using Jaro Winkler, Levenstein or NGram distance

Notifications You must be signed in to change notification settings

ychantit/fuzzymatch_hiveUDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

fuzzy_match hive udf function

Hive udf utility method to do fuzzy string matching for two strings using Jaro Winkler (JW), Levensteing (LV) or Ngram (NG) distance.

fuzzy_match udf method is a wrapper of matching distance calculus available in lucene spell checker package :

JaroWinklerDistance

LevensteinDistance

NGramDistance

This projet provides an implementation example of Hive GenericUDF

fuzzy_match hive udf method intput & output

Param 1 : First string to match.

Param 2 : Second string to match with the first one.

Param 3 : Algo to be used in matching : JW, LV or NG.

Return : Double, the distance separating the two string

How to build fuzzy_match projet

fuzzy_match is a maven projet so building and installing it is straightforward with a mvn clean install The task will build a fat jar including all the dependencies of the fuzzy_match udf

How to use fuzzy_match method in hive script

  1. Put the jar fuzzy_text-1.0-SNAPSHOT.jar in your home directory, in my case /home/ych/fuzzy_match

  2. In your hive script or shell add the following two lignes :

     add jar /home/ych/fuzzy_match/fuzzytext-1.0-SNAPSHOT-fat.jar;
     CREATE TEMPORARY FUNCTION fuzzy_match as 'com.ych.fuzzytext.hive.udf.FuzzyMatch';'
    

That's it your are good to go !

Start using fuzzy_match select a, b, fuzzy_match(a,b,"JW") from mytable

About

a hive udf method to do fuzzy string matching using Jaro Winkler, Levenstein or NGram distance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages