SOUNDEX() and Manual input fields

When analysing transactional data an insight analyst will often come across manual input fields, a name field for example. In large datasets, with record numbers in the millions, there could potentially be dozens of variations for each “true” name in the data.  Not ideal.  Luckily, ACL offers some quick solutions to deal with the problem.

The SOUNDEX() function is one of them.  According to the ACL language reference[1] SOUNDEX():

…returns the American Soundex code for the evaluated string.  All codes are one letter followed by three numbers – for example, ‘F634’. The first character in the code represents the first letter of the string. Each number in the code represents one of the six American Soundex groups, which are composed of phonetically similar consonants. Based on these groups, the soundex process encodes the first three consonants in the string after the first letter. The process ignores capitalisation, vowels, the consonants ‘H’, ‘W’, and ‘Y’, and in long strings any consonants that appear after the three encoded consonants. One or more trailing zeros (0) in the returned code indicate an evaluated string with fewer than three consonants after the first letter.

In order to get the best use out of the function some preparatory work might be beneficial before calling on it.  For instance, with the description above kept in mind, we can come up with the following:

  • SOUNDEX( OMIT( REGEXREPLACE( Name , “[^A-Z\d\s:]”, ” ” ), v_StopWords, F ) )

The snippet is a little crowded at first sight, so let’s break it down into its integral parts.  The first character returned in a SOUNDEX() function is whatever the first character in the string is, so inconsistency in the data here will be a problem. For example, some records may begin with unexpected characters.  Names may appear such as “.Paul” or “(Paul” when the true name is “Paul”.  Therefore, a possible first step is to clean the field of all non-alpha numeric characters:

  • REGEXREPLACE( UPPER( Name ), “[^A-Z\d\s:]”, ” ” )

Similarly, some records may begin with some common words: “from” or “to”, titles such as: “MISS” or “DR”, even months of the year, “Feb …”, while others may not. In our example the data records may be written “Mr Paul”, “From Paul”, or “January Paul Payment”, all of which will cause their SOUNDEX() results to be different; so it is good practice to create a list of stop words to be omitted from the input to the SOUNDEX (). Stop words are some of the most commonly used words which are not significant in our queries but will skew our results. Creating a variable listing the stop words, and using the OMIT() function will do the trick:

  • v_StopWords = ” ‘JANUARY’, … ‘FROM’, … ‘MISS’, … ‘FEB’, … ‘THE’, … ‘DR’, … ‘A’ …”
  • OMIT ( Name, v_StopWords, F )

Combining everything in the above gives us the snippet in (1).  However, there are other things to consider. The function is designed to work with words pronounced in English, and produces varied results when used with other languages. Additionally, there are some alternatives to the SOUNDEX() function such as the FUZZYDUP command, or the functions ISFUZZYDUP(), LEVDIST(), or DICECOEFFICIENT(), any of which may be more relevant for a particular analytic job.

In any case, the SOUNDEX() function offers speedy results without the need to use more complicated harmonisation scripts.

If you have any questions about the SOUNDEX() function or want to learn more, please feel free to send Greg a message at greg.olenski@dataconsulting.co.uk

[1] http://docs.acl.com/acl/11/index.jsp?topic=%2Fcom.acl.language.help%2Flang_ref_functions%2Fr_function_reference.html Accessed 23 June, 2016.

Monday, July 4, 2016 In: Uncategorized Comments (None)

Contact

Pricing

Demo