Version 9.3 of ACL Analytics introduced us to the Fuzzy Duplicates command and two new functions that make use of the Levenshtein Distance. This month we will have a look at identifying fuzzy duplicates in different tables by performing a fuzzy join.
Let’s have a look at the data set below. We have a set of employees and a set of vendors and we would like to identify matches between the two based on their names.
For those who are not familiar with the LEVDIST() function, it returns the Levenshtein Distance between two strings. For example “Alice Cooper” and “Alice Kooper” have a Levenshtein Distance of 1 (i.e. one change needs to be made to match the two), “Axl Rose” and “Axel Roses” have a distance of 2 and “Alice Smith” and “Alice Jones” a distance of 5.
By changing the filter so it includes all records with distance of 2 or less, our new join also output:
In the following script, we are going to use the example above with a few tweaks so it is easier to run as a script. This example has been written for ACL users that have attended an ACL 302 course or an ACL 303 course and that some experience in writting scripts.
COMMENT
*************************************************************
*
* Fuzzy Join example – Identify fuzzy joins
*
* Description: This script identifies possible fuzzy duplicate by performing
* a join using the LEVDIST() function
*
* Created By: Alex Psarras
* Created On: 13/09/2012
*
* version 1.0
*
*************************************************************
SET SAFETY OFF
SET EXACT ON
CLOSE PRI SEC
COM ** Step 1
COM ** Prompt the user for the Levenshtein Distance limit
COM ** otherwise, you could hardcode this step
ACCEPT “Enter the Levenshtein Distance limit:” TO v_LevDist
COM ** Step 2
COM ** Create the key field we are going to join on in both tables
OPEN Employees
DELETE FIELD c_JoinKeyField OK
DEFINE FIELD c_JoinKeyField COMPUTED “X”
OPEN Vendors
DELETE FIELD c_JoinKeyField OK
DEFINE FIELD c_JoinKeyField COMPUTED “X”
COM ** Step 3
COM ** Join Vendors to Employees using the many-to-many option
COM ** and filter using the LEVDIST() function
OPEN Vendors
OPEN Employees SECONDARY
JOIN PKEY c_JoinKeyField FIELDS VendorID Contact SKEY c_JoinKeyField WITH EmpID Name IF LEVDIST( ALLTRIM(Contact), ALLTRIM(Employees.Name)) <= %v_LevDist% TO “Fuzzy_Join” PRESORT MANY SECSORT
COM ** Step 4
COM ** Add the Levenshtein Distance to the resulting table
OPEN Fuzzy_Join
DELETE FIELD c_LevDist OK
DEFINE FIELD c_LevDist COMPUTED LEVDIST( ALLTRIM(Contact), ALLTRIM(Name))
SET SAFETY ON
DISCLAIMER:
DataConsulting Ltd. provide the script “as is” and free of charge. DataConsulting: (a) Do not provide support for these scripts; (b) Make no warranties or representations, expressed or implied, with respect to the script, including its fitness for a particular purpose, merchantability, durability, quality or its non-infringement; (c) Do not warrant that the script is free from errors; and (d) Will not be liable for any damages (including, but not limited to indirect damages such as lost profits and lost data) arising out of the use of, or the inability to use the script. You agree to assume all risk of loss or damage arising from the use of the script.