In the first two articles, we were talking about Soundex and the wide net that it casts which leads to many false positives. The problem has been around for some time, and as Soundex has shown that it comes with compromises, engineers have been looking for ways to tweak the algorithm to encode for different uses. Let’s see about implementing one of them, NYSIIS!

NYSIIS

The New York State Identification and Intelligence System was developed in the 1970’s, and has since become a part of the New York State Division of Criminal Justice Services (the algorithm is still commonly referred to as NYSIIS). It has expanded phonetic encoding for European and Hispanic surnames, and thus typically has a higher accuracy match than standard Soundex typically does. The algorithm logic is well documented online.

Here it is for reference:

1 — Transcode the first characters of the name: MAC → MCC

KN → NN

K → C

PH → FF

PF → FF

SCH → SSS. 2 — Transcode the last characters of the name: EE, IE → Y

DT, RT, RD, NT, ND → D. 3 — First character of the key = first character of name. 4 — In the following rules, a scan is performed on the characters of the name, starting at the second character, and incrementing by one character each time: EV → AF, else A,E,I,O,U → A

Q → G

Z → S

M → N

KN → N, else K → C

SCH → SSS

PH → FF

H → If previous or next character is not a vowel, set to previous

W → If previous character is a vowel, set to previous. 5 — If last character is S, remove it. 6 — If last characters are AY, replace with Y. 7 — If last character is A, remove it. 8 — Collapse all strings of repeated characters 9 — Add original first character of name as first character of key

Okay, this one has a few more steps than the original Soundex did. To make it a little easier, let’s break it out into different parts.

Steps 1 and 2 will be the easiest to combine into a SQL UDF. The series of helper UDF functions specific to NYSIIS we will prefix with dq_hf_nysiis_ so we can find them easily later. The general helper functions will continue to be prefixed with dq_hf_gh_.

CREATE OR REPLACE FUNCTION

dq.dq_hf_nysiis_step1_transcodefirstlast (instring string) AS (

/*

* (Helper) Data Quality Function - NYSIIS

* input: String to transcode.

* returns: String with the first characters and last

* characters transcoded for the first step in NYSIIS.

*/

CONCAT(

CASE

WHEN STARTS_WITH(instring, "MAC")

THEN CONCAT("MCC", SUBSTR(instring, 4, LENGTH(instring)-5))

WHEN STARTS_WITH(instring, "KN")

THEN CONCAT("NN", SUBSTR(instring, 3, LENGTH(instring) - 4))

WHEN STARTS_WITH(instring, "K") THEN

CONCAT("C", SUBSTR(instring, 2, LENGTH(instring) - 3))

WHEN STARTS_WITH(instring, "PH")

THEN CONCAT("FF", SUBSTR(instring, 3, LENGTH(instring) - 4))

WHEN STARTS_WITH(instring, "PF")

THEN CONCAT("FF", SUBSTR(instring,3, LENGTH(instring) - 4))

WHEN STARTS_WITH(instring, "SCH")

THEN CONCAT("SSS", SUBSTR(instring, 4, LENGTH(instring) - 5))

ELSE

SUBSTR(instring, 1, LENGTH(instring) - 2)

END

,

CASE

WHEN (SUBSTR(instring, -2) = "EE")

OR (SUBSTR(instring, -2) = "IE" )

THEN "Y"

WHEN (SUBSTR(instring, -2) = "DT")

OR (SUBSTR(instring, -2) = "RT")

OR (SUBSTR(instring, -2) = "RD")

OR (SUBSTR(instring, -2) = "NT")

OR (SUBSTR(instring, -2) = "ND")

THEN "D"

ELSE

SUBSTR(instring, -2)

END

)

);

From here, we want to operate on the second character onward, but still want to retain the first character of the key. Since we don’t have a procedural scripting language available yet that can store variables, we need to figure out how to maintain this through expressions. Let’s split them up, store them in an array, and pass that around as needed. The following is a general helper function that will create such an array, checking to handle a single character string as well.

CREATE OR REPLACE FUNCTION

dq.dq_hf_gh_SplitFirstCharacter(instring STRING) AS(

/*

* (Helper) Data Quality Function

* dq_hf_gh_SplitFirstCharacter

* input: Any string to split.

* returns: Array with the first character, followed by

* the remaining characters.

*/

IF

(LENGTH(instring) <= 1,

[instring, ''],

[SUBSTR(instring, 1, 1),

SUBSTR(instring, 2, LENGTH(instring))]) );

Let’s get into step 4, which is the meat of it. While I would like to have had this in a SQL UDF for performance reasons, there was just too much conditional logic to make that possible, so JavaScript it is! Let’s create a helper function that encapsulates that logic.

CREATE OR REPLACE FUNCTION

dq.dq_hf_nysiis_step4_transcodemain (inarray ARRAY<STRING>)

RETURNS ARRAY<STRING>

LANGUAGE js AS """

/*

* (Helper) Data Quality Function - NYSIIS

* dq_hf_nsiis_step4_transcodemain

* input: ARRAY[0] is the first character of the string.

* ARRAY[1] is the rest of the string.

* output: ARRAY[0] is the first character of the original string.

* output: ARRAY[1] has the main transcoding of the string remainder.

*/

instring = inarray[1];

var i;

var toReturn = "";

for(i = 0; i < instring.length; i++){

switch(instring.charAt(i)){

case 'E':

if(i + 1 < instring.length && instring.charAt(i + 1) == 'V'){

toReturn += 'AF';

i++;

}

else{

toReturn += 'A';

}

break;

case 'A':

toReturn += 'A';

break;

case 'I':

toReturn += 'A';

break;

case 'O':

toReturn += 'A';

break;

case 'U':

toReturn += 'A';

break;

case 'Q':

toReturn += 'G';

break;

case 'Z':

toReturn += 'S';

break;

case 'M':

toReturn += 'N';

break;

case 'K':

if (i + 1 < instring.length && instring.charAt(i + 1) == 'N'){

toReturn += 'N';

i++;

}

else{

toReturn += 'C';

}

break;

case 'S':

if(i + 2 < instring.length && instring.charAt(i + 1) == 'C' &&

instring.charAt(i + 2) == 'H'){

toReturn += 'SSS';

i++;

i++;

}

else{

toReturn += 'S';

}

break;

case 'P':

if(i + 1 < instring.length && instring.charAt(i + 1) == 'H'){

toReturn += 'FF';

i++;

}

else{

toReturn += 'P';

}

break;

case 'H':

if((i - 1 > -1 && ['a', 'e', 'i', 'o', 'u']

.indexOf(instring.charAt(i - 1).toLowerCase()) == -1) ||

(i + 1 < instring.length && ['a', 'e', 'i', 'o', 'u']

.indexOf(instring.charAt(i + 1).toLowerCase()) == -1)){

toReturn += toReturn.charAt(toReturn.length - 1);

}

else{

toReturn += 'H';

}

break;

case 'W':

if(i - 1 > -1 && ['a', 'e', 'i', 'o', 'u'].

indexOf(instring.charAt(i - 1).toLowerCase()) !== -1){

toReturn += toReturn.charAt(toReturn.length - 1);

}

else{

toReturn += 'W';

}

break;

default:

toReturn += instring.charAt(i);

break;

}

}

outArray = [inarray[0], toReturn];

return outArray;

"""

We’re now in the home stretch. Let’s remove the concurrently repeating characters (borrowing the helper function from implementing Soundex before), and reconstruct the array as a string.

CREATE OR REPLACE FUNCTION

dq.dq_hf_nysiis_DedupCombine (inarray ARRAY<STRING>) AS(

/*

* (Helper) Data Quality Function

* dq_hf_nysiis_DedupCombine

* input: Array with starting character, and converted string.

* returns: String with the first character combined with

the string removed of repeating characters.

*/

CONCAT(

inarray[OFFSET(0)],

dq.dq_hf_gh_RemoveDuplicateChars(

inarray[OFFSET(1)]))

)

Steps 5, 6, and 7 are the last bits of transcoding we need to do, so let’s encapsulate that in its own UDF. To do that, we’re first going to create a general helper function that will look for a character pattern at the end of a string, and replace it with another character pattern.

CREATE OR REPLACE FUNCTION

dq.dq_hf_gh_ReplaceEndWith(

instring STRING,

tofind STRING,

toreplace STRING) AS(

/*

* (Helper) Data Quality Function

* dq_hf_gh_ReplaceEndWith

* input: Any string to clean up.

Any string to look for.

Any string to replace at the end.

* returns: Looks for the tofind string at the end and

then replaces it with the toreplace string.

*/

IF

(ENDS_WITH(instring, tofind),

CONCAT(SUBSTR(instring, 1, LENGTH(instring) - LENGTH(tofind)), toreplace),

instring) );

This next bit is going to leverage the new function to replace characters according to steps 5, 6, and 7.

CREATE OR REPLACE FUNCTION

dq.dq_hf_nysiis_step8_clean_up_ends(instring STRING) AS(

/*

* (Helper) Data Quality Function

* dq_hf_nysiis_step8_clean_up_ends

* input: String to check the ends of.

* returns: String after removing the S at the end,

* replacing AY with Y,

* and removing A.

*/

dq.dq_hf_gh_ReplaceEndWith(

dq.dq_hf_gh_ReplaceEndWith(

dq.dq_hf_gh_ReplaceEndWith(instring,'S',''),

'AY','Y'),

'A','')

);

Let’s wrap that all up in order, and create our shiny new NYSIIS function. The one additional case we are adding in logic for here are for strings of length 1. In these cases, the only transcoding that’s done is K → C in step 1, and just return the original letter if not. It makes sense to put it here, so it short circuits the entire process and doesn’t waste processing time.

CREATE OR REPLACE FUNCTION

dq.dq_fm_NYSIIS(instring STRING) AS (

/*

* Data Quality Function - Fuzzy Matching

* dq_fm_NYSIIS

* input: String to encode.

* returns: NYSIIS code of the string.

*/

IF (LENGTH(instring) = 1,

IF (instring = "K", "C", instring), (

dq.dq_hf_nysiis_step8_clean_up_ends(

dq.dq_hf_nysiis_DedupCombine(

dq.dq_hf_nysiis_step4_transcodemain(

dq.dq_hf_gh_SplitFirstCharacter(

dq.dq_hf_nysiis_step1_transcodefirstlast(

UPPER(instring))

)

)

)

)

)

)

);

This should lead to a green checkmark, and should also spark joy!

So let’s give this a whirl. First, let’s test to see if the algorithm works. I’ve taken a series of names, and run them manually through an implementation of NYSIIS on the web to get the results. I’ve compared them to see if our BigQuery implementation here matches up with the expected NYSIIS code. I’ve tried to build a series of names that would test as many of the transcoding steps as I could.

WITH

input_data AS (

SELECT

["MACKENZIE,MCANSY","MCKENZIE,MCANSY",

"MORRISON,MARASA[N]","KNOLL,NNAL",

"KNIGHT,NNAGT","KIERNAN,CARNAN",

"SCHMIDT,SSNAD","SCOTT,SCAT",

"PFEIFFER,FFAFAR","PHAN,FFAN",

"PROCTER,PRACTA[R]","SVEE,SVY",

"LAMBERT,LANBAD","MINARD,MANAD",

"BRYANT,BRYAD","LUND,LAD",

"EVANS,EVAN","DEVOE,DAF",

"TOKAREV,TACARA[F]","SHUE,SH",

"MCKNIGHT,MCNAGT","GWYTHER,GWYTAR",

"BORENS,BARAN","LURAY,LARY",

"HUA,H","DOUGLAS,DAGL",

"LOUIS,L","BANANA,BANAN",

"CATHERINE,CATARA[N]","Katherine,CATARA[N]",

"Katarina,CATARA[N]","Jessica,JASAC",

"joshua,JAS","Wojszynski,WAJSYN[SC]"] AS names),

main_data AS (

SELECT

UPPER(SPLIT(names)[OFFSET(0)]) AS name,

SPLIT(names)[OFFSET(1)] AS expected_nysiis_code,

dq.dq_fm_NYSIIS(SPLIT(names)[OFFSET(0)])

AS output_nysiis_code

FROM

input_data,

UNNEST(names) AS names)

SELECT

*,

REGEXP_REPLACE(expected_nysiis_code, r'[\[\]]', '') =

output_nysiis_code AS nysiis_code_match

FROM

main_data

We get the following results, and they are matching!

Much success, again!

So on the whole, the implementation seems like it’s working. If you’re wondering what the brackets are on the expected_nysiis_code column, it’s there because the original NYSIIS algorithm had a character limit of 6, which is now generally the default length in most tools. There’s no reason why you can’t use the whole thing, though, so the letters in brackets are just the additional letters after the first six characters.

At the end of the first part of this series, when we were implementing Soundex, we noticed that sometimes it wouldn’t be all that accurate. For example, the names Wozniak, Wiggins, Wegonge, and Weiscmowsky all had the same Soundex code of W252. So what does NYSIIS give us?

Ah, this is more like it!

It looks like we are getting more precise answers than Soundex returned. This is good, as it means that we are getting better precision with the names and thus should get better groupings.

Unfortunately the script is a bit longer and more complex than I would have liked. In general, I try to stay with SQL UDFs and Standard SQL as much as I possibly can. It makes for much cleaner queries, allows for better debugging and introspection, but it also runs in the core data processing engine, whereas JavaScript UDFs require running it on Google V8. While these are heavily optimized engines, I would prefer to minimize external processes required for performance reasons. We have to use JavaScript here since we do need things like variables and for loops, and BigQuery does not yet have its own procedural scripting logic. One day if it does, though, it may be worth revisiting all this to see if we can refactor and optimize all this.

From here I’d like to start adding in other attributes, such as zip codes, for multi-criteria match groups, so stay tuned!