The Levenshtein Distance

2012-08-01

The Levenshtein Distance

Desktop version 9.3 builds on the ACL excellence, incorporating fuzzy match features which are extremely powerful. The latest version of ACL Analytics introduced us to a new command – FUZZYDUP – and two new functions – LEVDIST() and ISFUZZYDUP() – that rely on the Levenshtein Distance.

For those who are not familiar with it, the Levenshtein distance or, the edit distance as it is sometimes known, is a metric for calculating the difference between two strings. It was Vladimir Levenshtein who in 1965 first considered the metric as a means of quantifying error – a clever man! Through a series of substitutions, deletions and additions (each adding 1 to the score), the total transformation is minimised. By way of an illustration, take two words ‘Smith’ and ‘Smythe’.

The words ‘Smith’ and ‘Smythe’ have a Levenshtein Distance of 2 and we calculate it as follows:

Word 1 S m i t h
Word 2 S m y t h e
Score 0 0 1 0 0 1
The FUZZYDUP command

Used to detect nearly identical values (fuzzy duplicates) in a character field.

Syntax

FUZZYDUP ON {key_field} <OTHER fields> {LEVDISTANCE value} <DIFFPCT value> <RESULTSIZE value> <EXACT> <IF test> TO table_name <LOCAL> <OPEN>

Parameters

  • ON key_field: Specifies the character field or expression to test for fuzzy duplicates.
  • OTHER fields: Specifies a list of fields or expressions to include in the output.
  • LEVDISTANCE value: Specifies the maximum allowable Levenshtein Distance between two strings for them to be identified as fuzzy duplicates and included in the output results. The LEVDISTANCE value cannot be less than 1 or greater than 10.
  • DIFFPCT value: Optional. Specifies a threshold that limits the ‘difference percentage’ or the proportion of a string that can be different. The percentage that results from an internal ACL calculation performed on potential fuzzy duplicate pairs must be less than or equal to the DIFFPCT value for the pair to be included in the output results. The DIFFPCT value cannot be less than 1 or greater than 99. If the parameter is omitted the threshold is turned off and difference percentage is not considered during processing of the FUZZYDUP command.
  • RESULTSIZE value: Optional. Specifies the maximum size of the result set when calculated as a percentage of the test field size. The RESULTSIZE value cannot be less than 1 or greater than 1000 (one thousand) percent. If the parameter is omitted the threshold is turned off and result size is not considered during processing of the FUZZYDUP command.
  • EXACT: Optional. If specified, includes exact duplicates as well as fuzzy duplicates in the output results.
  • IF test: Optional. Specifies a condition that must be met. The command is only executed on records that pass the test.
  • TO table_name: Specify TO table_name to write the results to an ACL table. You must specify the table_name value as a quoted string with a .FIL file extension to create an ACL table. For example: TO “Output.FIL”. You can also specify an absolute or relative file path to an existing folder to write the .FIL file to. For example: TO “C:Output.FIL” or TO “ResultsOutput.FIL”
  • LOCAL: Optional. Specifies that the output file should be saved in the same location as the ACL project. This parameter only applies when the command is run on an ACL server table and the output file is an ACL table.
  • OPEN: Optional. Specifies that the table created by the command should be opened after the command executes. This parameter is only valid if the command creates an output table.

Note: Omitting the RESULTSIZE parameter can produce unduly large result sets that take a very long time to process, or can cause available memory to be exceeded, which terminates processing. Omit the parameter only if you are confident that the result set will be of a manageable size.

Syntax

GAPDUPn: Stores the total number of fuzzy duplicate groups identified by the command

Remarks

You can use the FUZZYDUP command to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. Unlike the ISFUZZYDUP( ) function, which identifies an exhaustive list of fuzzy duplicates for a single character value, the FUZZYDUP command identifies all fuzzy duplicates in a field, organises them in groups, and outputs a non-exhaustive result set.

Non-exhaustive means that individual fuzzy duplicate groups in the output results may not contain all the fuzzy duplicates in a test field that are within the specified degree of difference of the group owner. However, if a group owner is a fuzzy duplicate of another value in the test field, the two values will appear together in a group somewhere in the output results. If producing an exhaustive list of fuzzy duplicates for a specific value in the test field is important to your analysis, you can use the ISFUZZYDUP( ) function for this purpose.

The FUZZYDUP command has a number of parameters that you can specify to control the degree of difference between fuzzy duplicates, and the size of the output results. You may need to try different combinations of parameters and settings to find out what works best for a particular data set.

When processing data, the FUZZYDUP command calculates the Levenshtein Distance between each evaluated pair of strings in the test field, and calculates the difference percentage. The Levenshtein Distance is a value representing the minimum number of single character edits required to make one string identical to the other string. For more information, see LEVDIST( ) function. The difference percentage is the percentage of the shorter of the two evaluated strings that is different, and is the result of the following internal ACL calculation, which uses the Levenshtein Distance between the two strings:

Levenshtein Distance / number of characters in the shorter string × 100 = difference percentage

For detailed information about the fuzzy duplicate difference settings, controlling result size, and fuzzy duplicate groups, and see the ACL User Guide.

The command is not case-sensitive, so “SMITH” is equivalent to “smith.” The command also automatically trims trailing blanks in fields, so there is no need to use the TRIM( ) function when specifying a field as a parameter.

Concatenating two or more test fields can improve the effectiveness of the FUZZYDUP command by increase the degree of uniqueness of the test values. The OMIT( ) function can also improve the effectiveness of the command by removing generic elements such as “Corporation” or “Inc.” from field values. Removal of generic elements focuses the FUZZYDUP string comparison on just the portion of the strings where a meaningful difference may occur.

The SOUNDSLIKE( ) and SOUNDEX( ) functions provide a method for comparing strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).

FUZZYDUP command example

The following example tests a full name field (this is computed field that combines a first name and a last name field) for fuzzy duplicates and outputs the results to a new ACL table.

  • In addition to the test field, other fields are included in the output results.
  • The maximum allowable Levenshtein Distance is 2.
  • The proportion of a string that can be different is limited to 50%.
  • The size of the result set is limited to 20% of the test field size.

Levenshtein Distance results

The LEVDIST() function

Returns the Levenshtein Distance between two specified strings, which is a measurement of how much the two strings differ.

Syntax

LEVDIST(string1, string2 )

Parameters

  • string1: Character. The first string in the comparison.
  • string2: Character. The second string in the comparison.
  • case_sensitive: Optional. Logical. Specify T for a case-sensitive comparison of strings, or F to ignore case. If the case_sensitive parameter is omitted the default value of T is used.

Output

Numeric. The value is the Levenshtein Distance between two strings.

Remarks

You can use this function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. The function also identifies exact duplicates.

The function returns the Levenshtein Distance between the two evaluated strings, which is a value representing the minimum number of single character edits required to make one string identical to the other string. The edits can be of three types: insertion, deletion, and substitution. The greater the Levenshtein Distance, the greater the difference between two strings. A distance of zero (0) means the strings are identical. Punctuation marks and special characters are treated as single characters, just like letters. Changing the case of a character counts as one substitution, unless you turn off case sensitivity using the case_sensitive parameter.

Levenshtein Distance takes the position of characters into account. The same characters ordered differently may result in a different Levenshtein Distance. For example, LEVDIST(“abc”, “dec”) = 2, whereas LEVDIST(“abc”, “cde”) = 3.

To ensure accurate results when using LEVDIST( ) to compare a literal string such as “Smith” with a character field, you must use the TRIM( ) function to remove trailing blanks from the field. The Levenshtein algorithm counts blanks as characters, so any trailing blanks are included in the calculation of the number of edits required to make two strings identical.

Examples

Example

Return value

LEVDIST(“smith”,”Smythe”) 3
Two substitutions and one insertion are required to transform “smith” into “Smythe”.
LEVDIST(“smith”,”Smythe”, F) 2
Because case is ignored, only two substitutions are required to transform “smith” into “Smythe”.
LEVDIST(ALLTRIM(Surname),”Smith”) The Levenshtein Distance between each value in the “Surname” field and the string “Smith”.

You can use the expression LEVDIST(ALLTRIM(Surname),”Smith”) to create a computed field, and then sort the computed field in ascending order to rank all values in the “Surname” field by their amount of difference from “Smith”.

To create a filter that isolates all values in the “Surname” field that are within a specified Levenshtein Distance of “Smith”, specify:

LEVDIST(TRIM(Surname),”Smith”)<3

Changing the number in the expression allows you to adjust the amount of Levenshtein Distance in the filtered values.

The ISFUZZYDUP() function

Returns a logical value indicating whether a string is a fuzzy duplicate of a comparison string.

Syntax

ISFUZZYDUP(string1, string2, levdist <,diffpct>)

Parameters

  • string1: Character. The first string in the comparison.
  • string2: Character. The second string in the comparison.
  • levdist: Numeric. The maximum allowable Levenshtein Distance between the two strings for them to be identified as fuzzy duplicates. The levdist value cannot be less than 1 or greater than 99.
  • diffpct: Optional. Numeric. A threshold that limits the ‘difference percentage’ or the proportion of a string that can be different. The percentage that results from an internal ACL calculation performed on potential fuzzy duplicate pairs must be less than or equal to the diffpct value for the ISFUZZYDUP( ) function to evaluate to T (true). The diffpct value cannot be less than 1 or greater than 99. If the parameter is omitted the threshold is turned off and difference percentage is not considered during processing of the ISFUZZYDUP() function.

Output

Logical. Returns T (true) if the string parameter values are fuzzy duplicates, and F (false) otherwise.

Remarks

You can use the ISFUZZYDUP( ) function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. Unlike the FUZZYDUP command, which identifies all fuzzy duplicates in a field, organises them in groups, and outputs a non-exhaustive result set, the ISFUZZYDUP( ) function can identify an exhaustive list of fuzzy duplicates for a single character value.

Exhaustive means that all values within the specified degree of difference of the test value are returned, regardless of their position in the test field relative to the test value. The function is useful if the non-exhaustive results produced by FUZZYDUP are not sufficient for the purposes of your analysis, and you need to directly scrutinise every fuzzy duplicate for a specific character value.

The function also identifies exact duplicates. Unlike the FUZZYDUP command, you cannot exclude exact duplicates when using the function.

When processing data, the ISFUZZYDUP( ) function calculates the Levenshtein Distance between the two evaluated strings, and calculates the difference percentage. The Levenshtein Distance is a value representing the minimum number of single character edits required to make one string identical to the other string. For more information, see LEVDIST( ) function. The difference percentage is the percentage of the shorter of the two evaluated strings that is different, and is the result of the following internal ACL calculation, which uses the Levenshtein Distance between the two strings:

Levenshtein Distance / number of characters in the shorter string × 100 = difference percentage

For the ISFUZZYDUP() function to evaluate to T (true), the Levenshtein Distance must be less than or equal to the levdist parameter value, and the difference percentage must be less than or equal to the diffpct parameter value (if specified).

The function is not case-sensitive, so “SMITH” is equivalent to “smith.” The function also automatically trims trailing blanks in fields, so there is no need to use the TRIM() function when specifying a field as a parameter.

Examples

 

Example

Return value

ISFUZZYDUP(“Smith”,”Smythe”,1,99) F
Two edits are required to transform “Smith” into “Smythe”, but the levdist parameter value is only 1.
ISFUZZYDUP(“Smith”,”Smythe”,2,99) T
Two edits are required to transform “Smith” into “Smythe”, and the levdist parameter value is 2.
ISFUZZYDUP(“Smith”,”Smythe”,2,35) F
Two edits are required to transform “Smith” into “Smythe”, the shorter string is 5 characters, yielding a difference percentage of 40 (2/5 * 100), which is not less than or equal to the diffpct parameter value of 35.
ISFUZZYDUP(Surname,”Smith”,3,99) A logical value (T or F) indicating whether individual values in the “Surname” field are fuzzy duplicates for the string “Smith”.

 

To create a filter that isolates all values in the “Surname” field that are fuzzy duplicates for “Smith”, specify:

ISFUZZYDUP(Surname,”Smith”,3,99)

Changing the levdist or diffpct parameter values allows you to adjust the amount of difference in the filtered values.

 

Wednesday, August 1, 2012 In: Hot Topics Comments (None)

Contact us

3 Appleton Court, Calder Park
Wakefield, WF2 7AR

+44 (0) 1924 254 101

enquiries@dataconsulting.co.uk

Mailing List

Subscribe to our newsletter.