Blog

Non-Relational Data

Every once in a great while, the enterprising SQL analyst is confronted with data that is not relational in nature. Often this data has been imported from an event-tracking system or a NoSQL database. Such data commonly takes the form of comma-separated values.

For example, we may have data on our top-purchasing users per product that looks like this:

Dealing with this data can be a pain in the SQL. Let’s say we want to count the number of purchasers for each product in this table. Depending on your DB system, there are a number of ways to do it.

Postgres: Using regexp_split Functions

As always, Postgres’s solution is straightforward. Given a regular expression, functions are available that split the string to a table or to a Postgres array.

The table-based solution is more intuitive:

select product_name, regexp_split_to_table(top_purchasing_users, ',') from top_purchasers_per_product

This breaks each user out onto her own line:

From here it’s a simple matter of grouping and counting:

select product_name, count(1) from ( select product_name, regexp_split_to_table(top_purchasing_users, ',') from top_purchasers_per_product ) purchasers group by 1

To avoid creating a table and aggregating it again, we can also aggregate to an array and then get the array’s length:

select product_name, array_length( regexp_split_to_array(top_purchasing_users, ',') , 1 ) from top_purchasers_per_product

Both strategies give us our hoped-for final result: number of top purchasers per product!

MySQL and Redshift: Remove Commas and Compare String Lengths

MySQL and Redshift lack special functions to save the day, so we’ll fall back on a hack: Remove all the commas from the string, and see how much shorter it is!

Remarkably, the syntax on both these databases is exactly the same:

select product_name, length(top_purchasing_users) - length(replace(top_purchasing_users, ',', '')) + 1 from top_purchasers_per_product

length gives us the length of a string, and replace(top_purchasing_users, ‘,’, ”) replaces commas with empty strings, effectively removing them from the string! If there are 3 commas in the string then there are 4 purchasers, so we add 1 to the result.

Bonus Round: Recursive CTEs in SQL Server

The comma-replacement trick works in SQL Server as well. But if, like us, you always enjoy an excuse to try a recursive solution, then read on!

In SQL Server, we can use a Recursive CTE (also known as a with clause) to log the positions of the commas in the string. Let’s take a look:

with comma_positions as ( select product_name, top_purchasing_users, charindex(',', top_purchasing_users) as comma_pos from top_purchasers_per_product union all select product_name, top_purchasing_users, comma_pos + charindex(',', substring(product_name, comma_pos + 1, len(s)) ) as comma_pos from comma_positions where charindex(',', substring(product_name, comma_pos + 1, len(s)) ) > 0 )

The line above union all is our base case: We start with the product name, the list of top purchasing users, and the position of the first comma.

The clause immediately after the union all is where the magic happens. Each time we recurse, we keep the product_name and top_purchasing_users intact. But when finding the comma position, we start our search at the previous recursion’s comma position!

substring(product_name, comma_pos + 1, len(s)) is the part of the string that starts right after the previous comma position, and the charindex surrounding it finds the first comma position in that substring. We reassign that comma position to comma_pos to set up the next recusion.

Finally, our where clause terminates the recursion if there are no more commas.

The resulting table looks like this:

From here, it’s a simple group-and-count to get the total number of commas:

with comma_positions as ( select product_name, top_purchasing_users, charindex(',', top_purchasing_users) as comma_pos from top_purchasers_per_product union all select product_name, top_purchasing_users, comma_pos + charindex(',', substring(product_name, comma_pos + 1, len(s)) ) as comma_pos from comma_positions where charindex(',', substring(product_name, comma_pos + 1, len(s)) ) > 0 ) select product_name, count(1) + 1 from comma_positions group by product_name

As before, there’s one more purchaser than comma, so we add one to the result! Our resulting table is exactly as we expect:

As always, Captain Picard wins the day.