2012-08-01
Desktop version 9.3 builds on the ACL excellence, incorporating fuzzy match features which are extremely powerful. The latest version of ACL Analytics introduced us to a new command – FUZZYDUP – and two new functions – LEVDIST() and ISFUZZYDUP() – that rely on the Levenshtein Distance.
For those who are not familiar with it, the Levenshtein distance or, the edit distance as it is sometimes known, is a metric for calculating the difference between two strings. It was Vladimir Levenshtein who in 1965 first considered the metric as a means of quantifying error – a clever man! Through a series of substitutions, deletions and additions (each adding 1 to the score), the total transformation is minimised. By way of an illustration, take two words ‘Smith’ and ‘Smythe’.
The words ‘Smith’ and ‘Smythe’ have a Levenshtein Distance of 2 and we calculate it as follows:
Word 1 | S | m | i | t | h | |
Word 2 | S | m | y | t | h | e |
Score | 0 | 0 | 1 | 0 | 0 | 1 |
Used to detect nearly identical values (fuzzy duplicates) in a character field.
Syntax
FUZZYDUP ON {key_field} <OTHER fields> {LEVDISTANCE value} <DIFFPCT value> <RESULTSIZE value> <EXACT> <IF test> TO table_name <LOCAL> <OPEN>
Parameters
Note: Omitting the RESULTSIZE parameter can produce unduly large result sets that take a very long time to process, or can cause available memory to be exceeded, which terminates processing. Omit the parameter only if you are confident that the result set will be of a manageable size.
Syntax
GAPDUPn: Stores the total number of fuzzy duplicate groups identified by the command
Remarks
You can use the FUZZYDUP command to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. Unlike the ISFUZZYDUP( ) function, which identifies an exhaustive list of fuzzy duplicates for a single character value, the FUZZYDUP command identifies all fuzzy duplicates in a field, organises them in groups, and outputs a non-exhaustive result set.
Non-exhaustive means that individual fuzzy duplicate groups in the output results may not contain all the fuzzy duplicates in a test field that are within the specified degree of difference of the group owner. However, if a group owner is a fuzzy duplicate of another value in the test field, the two values will appear together in a group somewhere in the output results. If producing an exhaustive list of fuzzy duplicates for a specific value in the test field is important to your analysis, you can use the ISFUZZYDUP( ) function for this purpose.
The FUZZYDUP command has a number of parameters that you can specify to control the degree of difference between fuzzy duplicates, and the size of the output results. You may need to try different combinations of parameters and settings to find out what works best for a particular data set.
When processing data, the FUZZYDUP command calculates the Levenshtein Distance between each evaluated pair of strings in the test field, and calculates the difference percentage. The Levenshtein Distance is a value representing the minimum number of single character edits required to make one string identical to the other string. For more information, see LEVDIST( ) function. The difference percentage is the percentage of the shorter of the two evaluated strings that is different, and is the result of the following internal ACL calculation, which uses the Levenshtein Distance between the two strings:
Levenshtein Distance / number of characters in the shorter string × 100 = difference percentage
For detailed information about the fuzzy duplicate difference settings, controlling result size, and fuzzy duplicate groups, and see the ACL User Guide.
The command is not case-sensitive, so “SMITH” is equivalent to “smith.” The command also automatically trims trailing blanks in fields, so there is no need to use the TRIM( ) function when specifying a field as a parameter.
Concatenating two or more test fields can improve the effectiveness of the FUZZYDUP command by increase the degree of uniqueness of the test values. The OMIT( ) function can also improve the effectiveness of the command by removing generic elements such as “Corporation” or “Inc.” from field values. Removal of generic elements focuses the FUZZYDUP string comparison on just the portion of the strings where a meaningful difference may occur.
The SOUNDSLIKE( ) and SOUNDEX( ) functions provide a method for comparing strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).
The following example tests a full name field (this is computed field that combines a first name and a last name field) for fuzzy duplicates and outputs the results to a new ACL table.
Returns the Levenshtein Distance between two specified strings, which is a measurement of how much the two strings differ.
Syntax
LEVDIST(string1, string2 )
Parameters
Output
Numeric. The value is the Levenshtein Distance between two strings.
Remarks
You can use this function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. The function also identifies exact duplicates.
The function returns the Levenshtein Distance between the two evaluated strings, which is a value representing the minimum number of single character edits required to make one string identical to the other string. The edits can be of three types: insertion, deletion, and substitution. The greater the Levenshtein Distance, the greater the difference between two strings. A distance of zero (0) means the strings are identical. Punctuation marks and special characters are treated as single characters, just like letters. Changing the case of a character counts as one substitution, unless you turn off case sensitivity using the case_sensitive parameter.
Levenshtein Distance takes the position of characters into account. The same characters ordered differently may result in a different Levenshtein Distance. For example, LEVDIST(“abc”, “dec”) = 2, whereas LEVDIST(“abc”, “cde”) = 3.
To ensure accurate results when using LEVDIST( ) to compare a literal string such as “Smith” with a character field, you must use the TRIM( ) function to remove trailing blanks from the field. The Levenshtein algorithm counts blanks as characters, so any trailing blanks are included in the calculation of the number of edits required to make two strings identical.
Examples
Example |
Return value |
LEVDIST(“smith”,”Smythe”) | 3 Two substitutions and one insertion are required to transform “smith” into “Smythe”. |
LEVDIST(“smith”,”Smythe”, F) | 2 Because case is ignored, only two substitutions are required to transform “smith” into “Smythe”. |
LEVDIST(ALLTRIM(Surname),”Smith”) | The Levenshtein Distance between each value in the “Surname” field and the string “Smith”. |
You can use the expression LEVDIST(ALLTRIM(Surname),”Smith”) to create a computed field, and then sort the computed field in ascending order to rank all values in the “Surname” field by their amount of difference from “Smith”.
To create a filter that isolates all values in the “Surname” field that are within a specified Levenshtein Distance of “Smith”, specify:
LEVDIST(TRIM(Surname),”Smith”)<3
Changing the number in the expression allows you to adjust the amount of Levenshtein Distance in the filtered values.
Returns a logical value indicating whether a string is a fuzzy duplicate of a comparison string.
Syntax
ISFUZZYDUP(string1, string2, levdist <,diffpct>)
Parameters
Output
Logical. Returns T (true) if the string parameter values are fuzzy duplicates, and F (false) otherwise.
Remarks
You can use the ISFUZZYDUP( ) function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. Unlike the FUZZYDUP command, which identifies all fuzzy duplicates in a field, organises them in groups, and outputs a non-exhaustive result set, the ISFUZZYDUP( ) function can identify an exhaustive list of fuzzy duplicates for a single character value.
Exhaustive means that all values within the specified degree of difference of the test value are returned, regardless of their position in the test field relative to the test value. The function is useful if the non-exhaustive results produced by FUZZYDUP are not sufficient for the purposes of your analysis, and you need to directly scrutinise every fuzzy duplicate for a specific character value.
The function also identifies exact duplicates. Unlike the FUZZYDUP command, you cannot exclude exact duplicates when using the function.
When processing data, the ISFUZZYDUP( ) function calculates the Levenshtein Distance between the two evaluated strings, and calculates the difference percentage. The Levenshtein Distance is a value representing the minimum number of single character edits required to make one string identical to the other string. For more information, see LEVDIST( ) function. The difference percentage is the percentage of the shorter of the two evaluated strings that is different, and is the result of the following internal ACL calculation, which uses the Levenshtein Distance between the two strings:
Levenshtein Distance / number of characters in the shorter string × 100 = difference percentage
For the ISFUZZYDUP() function to evaluate to T (true), the Levenshtein Distance must be less than or equal to the levdist parameter value, and the difference percentage must be less than or equal to the diffpct parameter value (if specified).
The function is not case-sensitive, so “SMITH” is equivalent to “smith.” The function also automatically trims trailing blanks in fields, so there is no need to use the TRIM() function when specifying a field as a parameter.
Examples
Example |
Return value |
ISFUZZYDUP(“Smith”,”Smythe”,1,99) | F Two edits are required to transform “Smith” into “Smythe”, but the levdist parameter value is only 1. |
ISFUZZYDUP(“Smith”,”Smythe”,2,99) | T Two edits are required to transform “Smith” into “Smythe”, and the levdist parameter value is 2. |
ISFUZZYDUP(“Smith”,”Smythe”,2,35) | F Two edits are required to transform “Smith” into “Smythe”, the shorter string is 5 characters, yielding a difference percentage of 40 (2/5 * 100), which is not less than or equal to the diffpct parameter value of 35. |
ISFUZZYDUP(Surname,”Smith”,3,99) | A logical value (T or F) indicating whether individual values in the “Surname” field are fuzzy duplicates for the string “Smith”. |
To create a filter that isolates all values in the “Surname” field that are fuzzy duplicates for “Smith”, specify:
ISFUZZYDUP(Surname,”Smith”,3,99)
Changing the levdist or diffpct parameter values allows you to adjust the amount of difference in the filtered values.