ACL Tips & Scripts – Fuzzy Joins

2012-09-01

ACL Tips & Scripts: Fuzzy Joins

Version 9.3 of ACL Analytics introduced us to the Fuzzy Duplicates command and two new functions that make use of the Levenshtein Distance. This month we will have a look at identifying fuzzy duplicates in different tables by performing a fuzzy join.

Let’s have a look at the data set below. We have a set of employees and a set of vendors and we would like to identify matches between the two based on their names.

Employees Vendors set
A simple join between the two tables on the “Name” and “Contact” fields will return no results. Prior to version 9.3, ACL users would perform a many-to-many Join with the FIND() or the SOUNDEX() functions to get a better match. However this would return quite a few false positives (for example “Alice Smith” and “Alice Jones”) and would still miss out obvious matches that have been misspelled (for example “Cooper” and “Kooper”). The reason why we do a many-to-many join instead of a simpler join is because it is the only type of join that allows us to filter on the secondary table. The filter we will create will compare the “Name” and the “Contact” fields using the LEVDIST() function.

For those who are not familiar with the LEVDIST() function, it returns the Levenshtein Distance between two strings. For example “Alice Cooper” and “Alice Kooper” have a Levenshtein Distance of 1 (i.e. one change needs to be made to match the two), “Axl Rose” and “Axel Roses” have a distance of 2 and “Alice Smith” and “Alice Jones” a distance of 5.

Employees Vendors set 2
Before we perform the many-to-many join between the two tables we need to take an additional step. We need a key field in both tables to join on that will always have the same value. For example, a computed field called “c_JoinKeyField” with a value of “X”. When joining the two tables on this new field, ACL will join EVERY record in the primary table to EVERY record in the secondary table. In this example, we have 18 records in one table and 17 in the other. An unfiltered many-to-many join on the “c_JoinKeyField” field will return 17 * 18 = 306 records in the new table.

mtm

However before we execute the join, we filter using the LEVDIST() function. In the expression builder of the Join command we are now given the option to select fields from the secondary table. If we want to see only exact matches and all records where the Levenshtein Distance is 1 then our filter should look like the following:

LEVDIST(ALLTRIM(Contact), ALLTRIM(Employees.Name)) <= 1
 expression

Adjust your filter as you see fit and execute the Join command. For a Levenshtein Distance of 1, our join will return:

  • “Denzel Washington” and “Densel Washington”
  • “Alice Kooper” and “Alice Cooper”
  • “Meet Loaf” and “Meat Loaf”
  • “Sean Penn” and “Sean Pen”
  • “Jamie Foxx” and “Jamie Fox”

By changing the filter so it includes all records with distance of 2 or less, our new join also output:

  • “Stephen Tyler” and “Steven Tyler”
  • “Axel Roses” and “Axl Rose”
The Fuzzy Join script

In the following script, we are going to use the example above with a few tweaks so it is easier to run as a script. This example has been written for ACL users that have attended an ACL 302 course or an ACL 303 course and that some experience in writting scripts.

COMMENT

*************************************************************

*

* Fuzzy Join example – Identify fuzzy joins

*

* Description: This script identifies possible fuzzy duplicate by performing

* a join using the LEVDIST() function

*

* Created By: Alex Psarras

* Created On: 13/09/2012

*

* version 1.0

*

*************************************************************

SET SAFETY OFF

SET EXACT ON

CLOSE PRI SEC

COM ** Step 1

COM ** Prompt the user for the Levenshtein Distance limit

COM ** otherwise, you could hardcode this step

ACCEPT “Enter the Levenshtein Distance limit:” TO v_LevDist

COM ** Step 2

COM ** Create the key field we are going to join on in both tables

OPEN Employees

DELETE FIELD c_JoinKeyField OK

DEFINE FIELD c_JoinKeyField COMPUTED “X”

OPEN Vendors

DELETE FIELD c_JoinKeyField OK

DEFINE FIELD c_JoinKeyField COMPUTED “X”

COM ** Step 3

COM ** Join Vendors to Employees using the many-to-many option

COM ** and filter using the LEVDIST() function

OPEN Vendors

OPEN Employees SECONDARY

JOIN PKEY c_JoinKeyField FIELDS VendorID Contact SKEY c_JoinKeyField WITH EmpID Name IF LEVDIST( ALLTRIM(Contact), ALLTRIM(Employees.Name)) <= %v_LevDist% TO “Fuzzy_Join” PRESORT MANY SECSORT

COM ** Step 4

COM ** Add the Levenshtein Distance to the resulting table

OPEN Fuzzy_Join

DELETE FIELD c_LevDist OK

DEFINE FIELD c_LevDist COMPUTED LEVDIST( ALLTRIM(Contact), ALLTRIM(Name))

SET SAFETY ON

 

Saturday, September 1, 2012 In: Hot Topics Comments (None)

Contact

Pricing

Demo