Arrays and Lists in SQL Server



The Long Version

An SQL text by Erland Sommarskog, SQL Server MVP. Latest revision: 2018-08-26.

Copyright applies to this text. See here for font conventions used in this article.

Introduction

If you were referred to this article in answer to your question in an SQL Server forum, you should read the short version of this article first and chances are very good that the short version is all you need for the moment. This article covers the topic in depth, and you would read it if you have an "advanced" scenario or you are more the geeky sort of person. Even if you intend to read this article in full, read the short version as an introduction first, because I will not repeat it all here.

A problem that has been popular over the years with SQL Server is how to handle a list of values. In the majority of the cases, people have a comma-separated list, because this format is produced by commonly used tools like multi-choice controls in .NET, Reporting Services and other places.

When I say that the problem is popular, I don't only mean that the questions are commonplace – but so are solutions. You can find no end of blog posts etc that presents string-splitting functions, including performance tests of such functions and there are function that are known to be the fastest etc.

The aim of this article is two-fold: 1) Give a general discussion of how to design string-splitting functions. 2) Present and discuss each method from the angles I bring up in the general discussion. This includes performance, but not only.

To back up my discussion on performance, I have run extensive performance tests, although it was a while ago now, and they don't cover the new possibilities that were added in SQL 2016. I have published the results from these tests in two appendixes. The first is from 2006 when I originally published the article for SQL 2005, and the second is from 2009 (and updated with some new findings for XML in 2010).

As you might sense, this article is of some age – I published the first version for SQL 2000 already in 2003 and I updated it several times for several years with the last major update in 2010. What I find surprising as I write this new introduction in 2016 is that the topic still is current. After all, it was already SQL 2008 that gave us what I thought would be the ultimate solution: table-valued parameters. I still think that table-valued parameters is the best way to do it, and I have written an article Using Table-Valued Parameters in SQL Server and .NET to demonstrate how to use them.

Nevertheless, questions on how to handle CSVs keep coming up in the forums, and partly this is to blame on the sad fact that not all environments support table-valued parameters.

So what you have here is the 2016 version of this article. (With some minor later admentments.) I have moved the material for beginners to the short version. I've rewritten this new introduction and added material on string_split and OPENJSON, new features in SQL 2016. I have also conducted a some general overhaul of the contents in general to modernise it a bit. (For instance, it was time to stop talking about features added in SQL 2005 as new features!)

Another topic I touch in this article is how to insert many rows efficiently. While this is something that is best done with table-valued parameters, some form of bulk load or SSIS, some of the methods in this text are interesting from this perspective as well.

Note: all samples in this article refer to the Northwind database. You can download the script to install it from Microsoft's web site.

Overview of the Methods

As I've already hinted there are quite a few methods to unpack a list into table. Here I will just give a quick overview of the methods, before I move on to the general considerations.

General Considerations

Interface

Most of the methods I present are packaged into functions that take an input parameter which is a list of values and returns a table, like this:

CREATE FUNCTION list_to_table (@list nvarchar(MAX)) RETURNS @tbl TABLE (number int NOT NULL) AS

The reason the methods are packaged in functions is obvious: this permits you to easily reuse the function in many queries. Here I will discuss some considerations about the interface of such functions.

The Input Parameters

In this article, as far as the input parameter is a string, it is always of the data type nvarchar(MAX) which can fit up to 2 GB of data.

I made this choice, because I wanted to make the functions as generally applicable as possible. By using nvarchar, the functions can handle Unicode input, and with MAX the functions permit unlimited input. With nvarchar(4000), they would silently yield incorrect results with longer input, which is very bad in my book.

Nevertheless, there is a performance cost for these choices. If you use an SQL collation, you should know that varchar gives you better performance (more on that in a minute). And generally, there is an overhead for using the MAX types over strings that have an upper limit in size. The latter live in memory only, while the MAX strings can spill to disk and that incurs an overhead, even if there is no actual spill.

Thus, the fastest methods you can find, typically have varchar(8000) as a parameter, but in my opinion that is an accident waiting to happen. And, it can be questioned how much the performance of a string-split function matters for strings of that relatively short size – very likely the cost for splitting the string is dwarfed by the cost of running the rest of the query. If you have lists that are so long that performance of the splitting matters, much likely they will exceed 8000 characters in size every now and then. See however the section Unpacking Lists in a Table, Take Two at the end of the article.

Some of the functions take a second parameter to permit you to specify the delimiter, or, in case of fixed-length, the element length. In some functions I have opted to hard-code the delimiter, but for all methods that use a delimiter (save dynamic SQL), you can always add such a parameter. To keep things simple, I have consistently used one-character delimiters. If you need multi-character delimiters, most methods can be extended to handle this.

The Output Table

The output from all functions is a table. For all methods, I've included one function for returning a table of strings, and for some methods also a function that returns a table of integers. (Which is likely to be the most common data type in this context.) If you have a list of integers and a function that returns strings, you can use it like this:

SELECT ... FROM tbl t JOIN list_to_table(@list) l ON t.id = convert(int, t.str)

Similar applies to other data types. You can easily clone a version of the function that has the conversion built-in. (However, check the Robustness section for some things to look out for.)

The data type that requires most consideration is strings. Return nvarchar or varchar? Obviously when you work with Unicode data, you need to get nvarchar strings back. It may be tempting to always return nvarchar, but for reasons that I will return to in the performance section, you should make sure that you have a varchar string when you join with a varchar column. For some of the functions in this article, I have added both a varchar and an nvarchar column to the return table.

In some functions, I also return the position in the list for the elements. This can be handy when you have two or more lists that are horizontal slices of the same source, so you can say things like:

INSERT tbl(id, col2, col3) SELECT a.number, b.str, c.str FROM intlist_to_table(@list1) a JOIN charlist_to_table(@list2) b ON a.listpos = b.listpos JOIN charlist_to_table(@list3) c ON a.listpos = c.listpos

That is, this is a way to insert many rows in one go, although it's not really the best one. Sometimes this can be OK if you only have two or three columns per row to insert, but as the number of parallel lists grows, it gets out of hand, and you should consider using a table-valued parameter instead or at least XML. The particular danger with the approach above is that if the lists get out of sync with each other, you will insert inconsistent data.

Knowing the list position is essential in situations when the list also implies the sorting for the output. Say that you get a list of US states like this: CA, TX, NY, FL, IL and the user wants the data to be sorted in that order. Without information about the list position, this is impossible.

For some methods, the list position can easily be derived from the method itself, for others (but not all) you can use the row_number() function.

Robustness

It can't be denied that parsing strings is a bit risky. As long as the input plays by the rules everything goes fine, but what happens if it doesn't? A good list-to-table function can protect you from some accidents, but not all. Here are a couple of situations to watch out for.

Delimiter in the Input

Say that you have a couple of city names like this: Berlin, Barcelona, Frankfurt (Main), Birmingham, København. From this you want to compose a comma-separated list that you pass to list-to-table function. With the names listed above, that works fine, but then some joker enters Dallas, TX. Oh-oh.

There are several ways to deal with this problem. One is to use a delimiter that is unlikely to appear in the input data, for instance a control character. Many programs put strings into quotes, so the above list would read "Berlin","Barcelona" etc. This latter format is not supported by any of the functions I present, but you could tweak some of them a bit to get there.

Sometimes you cannot really make any assumption about the delimiter at all, for instance if the source is user input or data on the wire. In such case you will need to use a method with a general escape mechanism, of which I present two, to wit XML and JSON. Or you can avoid the delimiter business entirely by using fixed-length strings.

When you work with lists of integers, it is unlikely that the delimiter would appear in the data itself.

Extra Spacing

If you have an input list that goes:

ALFKI, VINET, BERGS,FRANK

Do you want those extra spaces to be included in the data returned by the list-to-table function? Probably not. All functions in this article (but one!) strip trailing and leading spaces from list elements. However, there are some methods, where this is not possible. (Or more precisely, they are not able to handle inconsistent spacing.)

Illegal Input

Say that you have a function that accepts a list of integers, and the input is 9, 12, a, 23, 12. What should happen?

With no particular coding, SQL Server will give you a conversion error, and the batch will be aborted. If you prefer, you can add checks to your function so that the illegal value is ignored or replaced with NULL. This is also very easy to achieve in SQL 2012 and later thanks to try_convert.

To focus on the main theme, I have not added such checks to the functions in this article.

Empty Elements

What if a function that accepts a list of integers is fed the input: 9, 12,, 23, 12. How should that double comma be interpreted? If you just do a simple-minded convert, you will get a 0 back, which is not really good. It would be better to return NULL or just leave out the element. (Raise an error? You cannot raise errors in functions.)

One approach I have taken in some functions in this article is to avoid the problem altogether by using space as delimiter. But since T‑SQL does not provide a function to collapse internal spacing, the approach is not without problems. For methods that build on logic of traditional programming, you can easily handle multiple spaces, but for methods that uses a combination of charindex and set-based logic, you would still have to filter out empty elements in the WHERE clause. (Something I have not done in this article.)

Then again, if your function accepts a list of strings, you may simply consider the empty string to be like any other value and you want your function to retain it. .

Performance Considerations

While I have conducted performance tests and devoted long appendixes to them, a very important performance aspect is not with the methods themselves, but how you use them and how they are packaged. In this section I will look into some important issues.

varchar vs. nvarchar

As I discussed in the Interface section, it appears to be a good choice for a function that unpacks a list of strings to have an nvarchar column in its return table, so it can work with both Unicode and 8-bit data. Functionally, it's sound. Performance-wise it can be a disaster. Say that you have:

SELECT ... FROM tbl t JOIN list_to_table(@list) l ON t.indexedvarcharcol = l.nvarcharcol

Why is this bad? Recall that SQL Server has a strict data-type precedence, which says that if two values of different data types meet, the one with lower precedence is converted to the higher type. varchar has lower precedence than nvarchar. (Quite naturally, since the set of possible values for varchar is a subset of the possible values for nvarchar.) Thus, in the query above indexedvarcharcol will be converted to nvarchar. The cost for this depends on the collation of the column.

If the column has a Windows collation, SQL Server is still able to use the index, because in a Windows collation the 255 possible varchar values make up a true subset of all possible values for nvarchar. This permits SQL Server to do something known a range seek. My tests indicate a doubling or tripling of the execution time for the seek, which of course is reason enough to avoid the conversion. But to be fair: in many cases you would not notice.

On the other hand, with an SQL collation it is very likely that you would. Because in this case the index is completely useless, since in an SQL collation the rules for varchar and nvarchar are different and you will get a scan of some sort and for a large table it may take minutes to run the query when it would have taken milliseconds with an index seek.

Thus, it is instrumental that you have separate functions to return varchar and nvarchar data. Or do as I have done in some functions in this article: have two return columns, one for varchar and one for nvarchar. You can also always try to remember to convert the output column to the appropriate data type – but it's human to forget.

You can find out the collation of a column with sp_help.

There is one more thing to say on this theme, which I will come back to in a minute.

Inline, Multi-Statement and Temp Tables

While for longer the lists the performance of the method as such has some significance, a much important question is the overall performance of the full query where the list appears. Consider just this fairly simple query:

SELECT ... FROM tbl JOIN list_to_tbl(@list) l ON tbl.indexedcol = l.value

The index on indexedcol is a non-clustered index. If @list hits just a few rows in the table, you want an Index Seek + Key Lookup. But if @list hits many rows, a Clusterted Index Scan (or a Table Scan in case of a heap) is much more efficient. There are two reasons that @list can hit many rows: 1) there are many values in the list. 2) there is a skew with one or two of the value in @list accounting for 40 % of the values in indexedcol. Depending on the method you use, the optimizer may or may not be able do perform a good work in this regard. Another aspect is how much @list varies from call to call. A plan that is put into cache is based on how @list looked at the first call, so called parameter sniffing. In many cases, this works fine, but if the lists vary widely in length from each other or if there is a data skew, you may not want a cached plan to be reused. (For a much longer discussion on parameter sniffing, see my article Slow in the Application, Fast in SSMS?.)

To understand this a little more, let us look at the kind of "functions" we can use. ("Functions" in quotes, because not all methods we will look at use a function per se.) We can identify four main models:

Direct SQL . Methods without any function at all, but you compose a complete query out of the data. Either because you already have the data in a table, as in the case of table-valued parameters, or you skip the table entirely as you do with dynamic SQL or when you use many parameters.

. Methods without any function at all, but you compose a complete query out of the data. Either because you already have the data in a table, as in the case of table-valued parameters, or you skip the table entirely as you do with dynamic SQL or when you use many parameters. Intermediate Storage . For some methods you need to materialise the table shaped from the list somewhere, and for most other methods this is an option. Obviously this incurs an overhead, but as we shall see there may be other advantages. This alternative falls into three sub-categories: Multi-statement function . A multi-statement function has a body that is compiled and executed on its own. Thus, execution plan of the function is not part of the plan for the query where the function is called. The result is returned in a table variable. Bouncing data over a table variable . Rather than joining with the function directly, you insert the data into a table variable which you then join with your target table. Bouncing data over a temp table . Instead of a table variable, you insert the data into a temp table which you use in your main query.

. For some methods you need to materialise the table shaped from the list somewhere, and for most other methods this is an option. Obviously this incurs an overhead, but as we shall see there may be other advantages. This alternative falls into three sub-categories: T‑SQL Inline Function . An inline table-function in T‑SQL is only a function by name; in reality it is a parameterised view. SQL Server expands the function into the query as if it was a macro, and the optimizer works with the expanded query text. Thus, there is no overhead for intermediate storage in this case.

. An inline table-function in T‑SQL is only a function by name; in reality it is a parameterised view. SQL Server expands the function into the query as if it was a macro, and the optimizer works with the expanded query text. Thus, there is no overhead for intermediate storage in this case. Opaque Inline. Non-SQL methods such as the CLR or XML fall in between the two T‑SQL function types and I would also count string_split to this group. Like a multi-statement function, the function call is just an operator in the plan, but there is no intermediate storage, but the results of the operations are streamed into the query, so they are inline in that sense.

Let's now look at how these models works with the simple query above. I will jump a little bit forth and back in the list above.

A quick glance may give you the idea that Direct SQL is to prefer. And indeed, with both dynamic SQL and using many parameters you can get a plan that fits the query very well, both with regards to list length and skew. Unfortunately, these methods have other shortcomings, which rarely, if ever, make them good choices.

Table-valued parameters on the other hand is a very good pick. Table-valued parameters do not have distribution statistics, which means that the optimizer will not be able to handle skew correctly. However, the optimizer knows the cardinality, that is, how many rows there are in the table, and this is the most important fact about the list. As long as the data has a reasonably even distribution, this piece of information is enough to determine whether to seek the non-clustered in the example query or to scan the table. But keep in mind that the note on parameter sniffing applies.If you pass a lot more or fewer values in the TVP next time, this will not trigger recompilation, and the plan may not be optimal for this call. You can force recompilation yourself, though, with the hint OPTION (RECOMPILE) that you put at the end of the query.

If you bounce the data over a local table variable, it is a different matter. When compiling the procedure in whole, the optimizer will assume one row. This is what I here refer to as a blind assumption. As long as your lists are short, this will not matter much, but the more elements your lists have the less likely it would be that you get a good plan. However, as with a TVP you can trigger recompilation with OPTION (RECOMPILE), and in this case, the optimizer will use the actual cardinality.

Since a multi-statement function returns data in a table variable, it may seem that it would be the same as in the previous paragraph, but it is not really so. The exact behaviour depends on the version of SQL Server, or more precisely, the compatibility level:

Up to compatibility level 110 (which corresponds to SQL 2012): a blind assumption of one row.

Compatibility levels 120 and 130 (SQL 2014 and SQL 2016): a blind assumption of 100 rows.

Compatibility level 140 and up (SQL 2017 and later): the optimizer applies something known as interleaved execution. That is, the optimizer calls the function before compiling the rest of the query to see how many rows it returns, and it then uses that value when compiling the query. This means that the presumptions are the same as with a table-valued parameter. That is, the plan is accurate with regards to the list length, but not with regards to data skew, and parameter sniffing may work against you.

If we look at the group Opaque Inline, the optimizer makes blind assumptions for all of them, and the blind assumption is different depending on the method. What is important to observe is that interleaved execution is not employed for any of them, not even the built-in string_split. That is, you may or may not get the plan that fits you best, but it will not depend on your actual input. OPTION (RECOMPILE) is not going to help you, but on SQL 2017 and later, it may actually make sense to wrap the method in a multi-statement function to take benefit of interleaved execution.

With inline SQL, the optimizer has the full story not and just a table with an unknown number of rows coming back, so you may expect that you get a more accurate plan, but it is not really so. The logic for cracking the list is nothing the optimizer is able look through, and in practice what you get is another blind assumption. But there are more choices for the plan and I have seen cases where inline functions gave unexpected good performance, because the optimizer came up with a parallel plan. However, it cuts both ways. Here is an extract from a mail that I received in response to my old article for SQL 2000:

After reading your article 'Arrays and list in SQL server' I tried to use the Fixed-Length Array Elements method in my application. Everything worked fine until I moved the code from a client batch to a stored procedure. When I looked at the query execution plan, I saw that the number of rows retrieved from the Numbers table was over 61 millions ! Instead of starting by joining the Numbers table with the source table to filter out the 500 rows included in the array, it processes the GROUP BY clause on the entire table (121 000 rows) and then it uses a nested loop to match each entry with the Number table.

And this is not unique. In my performance tests I saw several example of where the optimizer outsmarted itself, and I got completely crazy execution times. I will return to this theme in the section JOIN vs. EXISTS below. I suggested my correspondent that he should try a multi-statement function instead, and that resolved his issue.

Let's finally look at temp tables, which addresses all the problems we have discussed so far:

Optimizer knows the cardinality. A temp table has distribution statistics, so the optimizer also have a chance to produce the best plan if there is a skew in the data. And since changes in statistics trigger recompilation, you can get different plans depending on the input.

There is however a caveat with last point. To be sure that the optimizer has correct information for the current execution, you need to add UPDATE STATISTICS for your temp table. You may think that autostats will set it in, but that does not always happen, and if you trigger a recompile by some other method, the plan may be compiled with the statistics from the previous execution! This puzzling behaviour is something that SQL Server MVP Paul White discusses in a lot more detail in his blog post Temporary Tables in Stored Procedures.

Here is an overview of which strategies that are possible with which methods:

Direct SQL T‑SQL Inline Opaque Inline Multi-Statement Table Variable Temp Table Table-valued Parameters Yes No No No Yes Yes string_split No No Yes Wrapper Yes Yes Iterative Method No No No Yes Yes Yes CLR No No Yes Wrapper Yes Yes XML and JSON No No Yes Wrapper Yes Yes Table of Numbers No Yes No Yes Yes Yes Fixed-Length No Yes No Yes Yes Yes Function of Numbers No Yes No Yes Yes Yes Recursive CTE No Yes No Yes Yes Yes Dynamic SQL Yes No No No No No Many Parameters Yes No No No Yes Yes List to SELECT No No No No Yes Yes Real Slow Yes No No No No No

MAX Types vs. Regular (n)varchar

In a previous section I discussed the problems with joining nvarchar and varchar. When I ran my performance tests and investigated some unexpected results; I discovered a second problem of a similar kind. Consider this:

SELECT ... FROM tbl t JOIN list_to_table(@list) l ON t.indexednvarcharcol = l.nvarcharmaxcol

The list-to-table function is here written in such a way that its return type is nvarchar(MAX). This too leads to an implicit conversion of the indexed column. It may not be apparent that it has to be that way at first sight, but when SQL Server evaluates an expression, it always works with the same data type for all operands. And nvarchar(n) is a different data type from nvarchar(MAX). The result of the implicit conversion is not fatal. The optimizer applies a range-seek operator and is still able to use the index, but nevertheless there is an overhead. When I initially ran my tests, I had not observed this issue, and my inline functions returned nvarchar(MAX) (of the simple reason that the input string was nvarchar(MAX)). As a consequence, my tests in some cases seemed to indicate that inline functions performed worse than the corresponding multi-statement solutions.

Presumably, most of the time when you use list-to-table functions for a list of strings, the list elements are short, just a few characters long. Therefore, there is all reason to make sure that your list-to-table function returns a regular varchar or nvarchar. Particularly, this means that for inline functions, you should make sure that the return value is explicitly converted to regular (n)varchar. You will see this in all inline functions in this article. But if you use the built-in string_split function, this is something you need to keep in mind yourself.

Collations

All functions in this article uses nvarchar both for parameters, output and internal variables. If you never work with Unicode data, you may think that you should rewrite the functions to use varchar instead, assuming that 8-bit characters are faster for SQL Server to work with than the 16-bit Unicode characters.

This may or may not be the case, depending on which collation you are using. As I discussed above under varchar vs. nvarchar, there are two sorts of collations: Windows collations and SQL collations. If you use a Windows collation, you get a slight reduction in performance if you use varchar rather than nvarchar. This is due to that with a Windows collation the Unicode rules and routines are always employed internally, so all that varchar buys you is extra conversion to and from Unicode.

On the other hand, with an SQL collation you can get some 30 % improvement in execution time with using varchar instead. This is because for SQL collations there are separate internal routines for string handling that work only on 8-bit character data, and these 8-bit routines are far simpler than those for Unicode. (If you have an SQL collation and use nvarchar, you are in fact using a Windows collation under the cover.)

Note here that the exact gain depends on the type of operation. 30 % is what you can expect from a plain equality test. There are situations where the difference between varchar and nvarchar in an SQL collation can be as much as a factor of 7. We will look at such case in the section on really slow methods.

If you have to use nvarchar, there is still a way to evade the overhead of the Unicode routines: use a binary collation. Now, if you opt to use a binary collation throughout your database, you will have to accept that all comparisons are case-insensitive and that sorting is funky, not the least for other languages than English. So for most applications, using a binary collation is not a viable option. However, there exists a second possibility: force the collation for a certain expression. I have employed this throughout this article where it makes sense. You will see a lot of things like:

charindex(@delimiter COLLATE Slovenian_BIN2, @list, @pos + 1)

Since @delimiter is cast to an explicit collation, this also happens with @list. (This is discussed in Books Online in the topic Collation Precedence.) When using charindex to find a delimiter, odds are good that you are looking for the exact delimiter and you have no need for case- or accent-insensitive searches. Thus, using a binary collation in this situation does not lead to any loss in functionality. When I tested this for the iterative method, I got some 10 % improvement in execution time.

(Why Slovenian? And why BIN2? Well, I had to pick something, but which binary collation you chose does not matter. I used Slovenian because my test data happens to be a list of Slovenian words from a spelling dictionary. And I selected BIN2, since these newer binary collations resolve some oddities of the older BIN collations. But as I said, all that matters in this case is that you select a binary collation.)

Scalability in a Multithread Environment

When I discuss performance in this article, I'm mainly concerned with the method as such, since this is mainly what I have tested. But if you believe that you will have many users concurrently submitting comma-separated lists or similar, you will also be interested in how these methods scale. In my suite of performance tests in 2009, I did a feeble attempt to test this, but the tests were not very conclusive. You can read about my observations in the second performance appendix, but I only comment on it in a few places in this article.

How to Use These Functions – JOIN vs. EXISTS

Once you have such a function, how would you use it? One way is to use a join:

SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN intlist_to_tbl(@ids) i ON P.ProductID = i.number

But you could also write the query as:

SELECT P.ProductName, P.ProductID FROM Northwind..Products P WHERE P.ProductID IN (SELECT i.number FROM intlist_to_tbl(@ids) i)

Or:

SELECT P.ProductName, P.ProductID FROM Northwind..Products P WHERE EXISTS (SELECT * FROM intlist_to_tbl(@ids) i WHERE P.ProductID = i.number)

In this article I almost exclusively use the first model, using joins. Does this mean that this is the best way to write such queries? It turns out that this is a more delicate question that it may seem at first glance.

Before I start, let me first say that there is no practical difference between IN and EXISTS, but they will always generate the same plan. So it is only a question on whether to use join on the one hand or IN/EXISTS on the other.

Next thing to note is that if you want to get the list position, you need to write a join, because then you are using the table function for more than just checking existence. But if we overlook that part, we can see that the join query above has a potential problem that the other two do not have. To wit, what if there are duplicates in the input list? Now, obviously, it could be the case that the business rules are such that if 27 appears multiple times in the list, Schoggi Schokolade should appear multiple times in the output. And in such case, the join is the only reasonable option.

But in most cases, you don't want duplicates in the output, and this means that if you use a join query, you need to eliminate duplicates with the DISTINCT keyword. And this means that SQL Server needs to put in extra work to sort (or hash) the data to get rid of the duplicates. With IN or EXISTS, you don't need DISTINCT, because you are only selecting rows from one table. Thus, EXISTS is leaner? Nah, not really. If you look at the IN/EXISTS queries above, they are in practice very similar to a join + DISTINCT, because that is how SQL Server will implement them.

So it doesn't matter which we use then? Maybe in some cases. The more complicated the query gets, the more likely there will be differences. If you use DISTINCT + JOIN, DISTINCT applies to the entire result set which may yield a different result than using IN or EXISTS. If the latter case, SQL Server knows that it only has to eliminate duplicates from the table-function. And the query plan may for other reasons include a hash or sort operator that can perform elimination of duplicates, so that there is no real extra cost.

Beside all the performance-ponderings there is the purist view: you should express your intent clearly, and if you mean to include the table function only to check for existence, you should state so in the query.

After all this rambling, it seems that I have arrived at the conclusion that IN/EXISTS are better than joins. So why do I use joins then in the article? There are two reasons. One is kind of shameful: when I first wrote the article for SQL 2000 many years ago my brain was an auto-pilot, and I just used what first came into my mind. The other reason is more heavy-duty: when I ran my performance tests in 2009 I added tests with EXISTS queries, and for all T‑SQL inline functions without exception this resulted in disastrous query plans. The execution time for an EXISTS query with a list of 20 elements ranged from 25 to 40 seconds. The corresponding execution time for the join queries were a few milliseconds in the very most cases. This abysmal performance was far less common with join queries.

As I discussed above in the section Inline, Multi-Statement and Temp Tables, inline T‑SQL functions give the optimizer a degree of freedom which is not really good for its health. I like to stress that my tests do not prove that SQL Server always freaks out on T‑SQL inline functions when EXISTS is involved. But nevertheless, it happened eight times out of eight in my tests, and while some functions are similar to each other, there are also some that are entirely different in nature.

So for this reason, I have retained the queries in the article as join queries. I have not added DISTINCT to them. After all, in many cases, you probably have control over how the lists are constructed, and you can be reasonably sure that they don't have duplicates.

At the end of the article there is a section of concluding performance remarks, where I further discuss this topic.

The Iterative Method

I will now describe the various methods to unpack lists into a table, one by one in detail.

I've opted to start with the iterative method. Together with using a recursive CTE, the iterative method needs the least amount of preparations. Just create the function and you are on the air. Most other methods require some extra step. Another advantage is that the code for the iterative method is very easy to understand, not the least if you have a background with traditional programming languages. This makes things simple if you need to adapt the code to a special input format.

The method is far from the fastest in the bunch, but as long as you mainly work with lists of reasonable length, you will find performance acceptable. If you often work with lists of several thousand elements, you should probably investigate some of the faster methods.

List-of-integers

You have already seen an example of the iterative method in introductory short version, but I repeat it here (with a different name, as all functions in this article has a name which includes the method used.)

CREATE FUNCTION iter$simple_intlist_to_tbl (@list nvarchar(MAX)) RETURNS @tbl TABLE (number int NOT NULL) AS BEGIN DECLARE @pos int, @nextpos int, @valuelen int SELECT @pos = 0, @nextpos = 1 WHILE @nextpos > 0 BEGIN SELECT @nextpos = charindex(',', @list, @pos + 1) SELECT @valuelen = CASE WHEN @nextpos > 0 THEN @nextpos ELSE len(@list) + 1 END - @pos - 1 INSERT @tbl (number) VALUES (convert(int, substring(@list, @pos + 1, @valuelen))) SELECT @pos = @nextpos END RETURN END

The idea is simple. We iterate over the string, look for commas, and then extract the values between the commas. Note the use of the third parameter to charindex, this specifies the position where to start searching for the next comma. The computation of @valuelen includes the only complexity: we must cater for the fact that charindex will return 0 when there are no more commas in the list.

However, this function is slower than it has to be. Recall what I said in the interface section above about there being an overhead for MAX types because need to be able to spill to disk. Originally, when I implemented the iterative method for SQL 2000, writing it like above was not possible, because there was no nvarchar(MAX) only the old ntext which was not fully supported by charindex. Therefore I had to implement a solution where I broke up the string into chunks of nvarchar(4000). When I ran my performance tests, I found that chunking was also a meaningful thing to do on SQL 2005 and later as this improves performance by 20-30 %.

There is a second problem with iter$simple_intlist_to_tbl: if you for some reason feed it two consecutive commas, this will result in a 0 in the output, which isn't really good. While you can easily address this by adding some extra logic to the function, my preference is to avoid the problem by using space as a separator. The comma does not really fill any purpose for a list of integers. Or so was my thought when I wrote the function many years ago. Little did I know that there are controls in .NET and SSRS that generate integer lists with comma as separator. So obviously a method that can handle comma is desirable. (And quite obviously it is reasonable to expect a list generated programmatically to obey the format and not have multiple consecutive commas.)

Nevertheless, I present the function as I originally wrote, as that is what I tested in my performance tests. Tweaking it to handle comma or any other separator is left as an exercise to the reader. So here it is:

CREATE FUNCTION iter_intlist_to_tbl (@list nvarchar(MAX)) RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL, number int NOT NULL) AS BEGIN DECLARE @startpos int, @endpos int, @textpos int, @chunklen smallint, @str nvarchar(4000), @tmpstr nvarchar(4000), @leftover nvarchar(4000) SET @textpos = 1 SET @leftover = '' WHILE @textpos <= datalength(@list) / 2 BEGIN SET @chunklen = 4000 - datalength(@leftover) / 2 SET @tmpstr = ltrim(@leftover + substring(@list, @textpos, @chunklen)) SET @textpos = @textpos + @chunklen SET @startpos = 0 SET @endpos = charindex(' ' COLLATE Slovenian_BIN2, @tmpstr) WHILE @endpos > 0 BEGIN SET @str = substring(@tmpstr, @startpos + 1, @endpos - @startpos - 1) IF @str <> '' INSERT @tbl (number) VALUES(convert(int, @str)) SET @startpos = @endpos SET @endpos = charindex(' ' COLLATE Slovenian_BIN2, @tmpstr, @startpos + 1) END SET @leftover = right(@tmpstr, datalength(@tmpstr) / 2 - @startpos) END IF ltrim(rtrim(@leftover)) <> '' INSERT @tbl (number) VALUES(convert(int, @leftover)) RETURN END

Here is an example on how you would use this function:

CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN iter_intlist_to_tbl(@ids) i ON P.ProductID = i.number go EXEC get_product_names_iter '9 12 27 37'

This function has two loops. One which creates the chunks, and one that iterates over the chunks. The first chunk is always 4 000 characters (provided that the input is that long, that is). As we come to end of a chunk, we are likely to be in the middle of an element, which we save in @leftover. We bring @leftover with us to the next chunk, and for this reason we may grab fewer than 4 000 characters from @list this time. When we have come to the last chunk, @leftover is simply the last list element.

Multiple spaces are handled by simply ignoring @str if it's blank.

There are two things from the general considerations that I have added to this function that was not in iter$simple_intlist_to_tbl:

The output table includes the list position. I use the COLLATE clause to force a binary collation to gain some further performance.

Before I close this section, I need to credit Sam Saffron who pointed out a performance flaw in my original implementation where I kept reallocating the string rather than using the third parameter of charindex.

List-of-strings

Here is a similar function, but that returns a table of strings.

CREATE FUNCTION iter_charlist_to_tbl (@list nvarchar(MAX), @delimiter nchar(1) = N',') RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL, str varchar(4000) NOT NULL, nstr nvarchar(2000) NOT NULL) AS BEGIN DECLARE @endpos int, @startpos int, @textpos int, @chunklen smallint, @tmpstr nvarchar(4000), @leftover nvarchar(4000), @tmpval nvarchar(4000) SET @textpos = 1 SET @leftover = '' WHILE @textpos <= datalength(@list) / 2 BEGIN SET @chunklen = 4000 - datalength(@leftover) / 2 SET @tmpstr = @leftover + substring(@list, @textpos, @chunklen) SET @textpos = @textpos + @chunklen SET @startpos = 0 SET @endpos = charindex(@delimiter COLLATE Slovenian_BIN2, @tmpstr) WHILE @endpos > 0 BEGIN SET @tmpval = ltrim(rtrim(substring(@tmpstr, @startpos + 1, @endpos - @startpos - 1))) INSERT @tbl (str, nstr) VALUES(@tmpval, @tmpval) SET @startpos = @endpos SET @endpos = charindex(@delimiter COLLATE Slovenian_BIN2, @tmpstr, @startpos + 1) END SET @leftover = right(@tmpstr, datalength(@tmpstr) / 2 - @startpos) END INSERT @tbl(str, nstr) VALUES (ltrim(rtrim(@leftover)), ltrim(rtrim(@leftover))) RETURN END

An example on how you would use this function:

CREATE PROCEDURE get_company_names_iter @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN iter_charlist_to_tbl(@customers, DEFAULT) s ON C.CustomerID = s.nstr go EXEC get_company_names_iter 'ALFKI, BONAP, CACTU, FRANK'

There are a few differences to iter_intlist_to_tbl. I've added a parameter to specify the delimiter, since you may need to use different delimiters depending on the data. Note that if you want to use the default of a table-valued function, you cannot leave out the parameter, but you have to specify the keyword DEFAULT to use the default value. Another addition is that I trim off space that appears directly adjacent to the delimiters, but space within the list elements is retained.

I like to give attention to the use of datalength. There are two system functions in T‑SQL to return the length of a string: len returns the number of characters, and does not count trailing space. datalength returns the number of bytes (whence all these / 2), and includes trailing spaces. I'm using datalength here, since there is no reason to ignore trailing spaces in the chunks – they could be in the middle of a list element.

Using the CLR

Introducing the CLR

We will now look at what is the fastest method if you are dead set on using a delimited list and don't want to use a table-valued parameter or change to a fixed-length format. This means that you write a stored procedure in C# or any other language that supported by .NET and the Common Language Runtime.

CLR has never caught on as one of those widely used features in SQL Server, but it is a very handy tool for doing things in SQL Server for which T‑SQL is not well equipped for and a list-to-table function is a prime example this. There is no data access, but it is all matter of parsing a string for which .NET offers a much richer set of operations. Furthermore, the CLR code is compiled in contrast to the interpreted T‑SQL, which is the main reason why the CLR give such a good performance. In the realm of table-valued functions there is another factor that improves performance: the output from a table-valued CLR function is not written into any intermediate storage, but the rows are fed into the rest of the query as soon as they are produced. So in that sense they are inline functions, but in difference to T‑SQL's own inline functions, the optimizer has no idea what they will produce, which is why I refer to them as opaque inline.

Unfortunately, this method is the one that is the most difficult to get started with. There was always a small hurdle, but not more than I felt that I could explain all steps in an article like this. However, with the introduction of CLR Strict Security in SQL 2017, the hurdle has become considerably higher to the extent that it is difficult to recommend this method to someone who has never used the CLR before. Not the least since the builtin string_split is likely to offer at least equally good performance, if less flexibility.

Thus, I see two scenarios you would go for this method:

You are already using the CLR, and you already know how to make an assembly trusted on server level and thus you are past the hurdle. You have a real high-end scenario with long lists where you find that string_split does meet your requirements.

Despite this, I have written the chapter, so that you can try the examples even if you are new to the CLR. However, I will be fairly brief on the proper ways to work with assemblies with CLR Strict Security enabled, and I have deferred that material to the last sectrion in this chapter.

To be able run the examples on your server, you need enable the CLR, as user-written CLR is disabled by default. You do this by running this script from a query window:

EXEC sp_configure clr enabled', 1 RECONFIGURE

Furthermore, if you are on SQL 2017, you will have to deal with CLR Strict Security. If you use your own private development server, there is an easy way out, to wit, turn it off:

EXEC sp_configure 'show advanced options', 1 RECONFIGURE EXEC sp_configure 'clr strict security', 0 RECONFIGURE

However, this is certainly not best practice on a shared server, even less a production server. But to keep the main body of this chapter simple, I will tacitly assume that you have disabled CLR Strict Security.

CLR Functions Using Split

We will look at two ways of implementing a list-to-table function in the CLR. The first one, with very little of our own code, serves as an introduction to CLR table functions rather than as an example of a function you actually should use. In the second alternative, we will roll our own, which opens for a higher degree of flexibility and also better scalability.

The Code

A complete C# file that implements two list-to-table functions, one for strings and one for integer has to be no longer than this:

using System.Collections; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; public class CLR_split { [SqlFunction(FillRowMethodName="CharlistFillRow")] public static IEnumerable CLR_charlist_split(SqlString str, SqlString delimiter) { return str.Value.Split(delimiter.Value.ToCharArray(0, 1)); } public static void CharlistFillRow(object row, out string str) { str = (string)row; str = str.Trim(); } [SqlFunction(FillRowMethodName="IntlistFillRow")] public static IEnumerable CLR_intlist_split(SqlString str) { return str.Value.Split((char[]) null, System.StringSplitOptions.RemoveEmptyEntries); } public static void IntlistFillRow(object row, out int n) { n = System.Convert.ToInt32((string) row); } }

Compile and Install

To compile this, open a command-prompt window and make sure that you have C:\Windows\Microsoft.NET\Framework\v4.0.30319 or corresponding in your path. Assuming that the name of the file is CLR_split.cs, the command is:

csc /target:library CLR_split.cs

This gives you CLR_split.dll. If your SQL Server is not on your local machine, you will have to copy the DLL to the server box, or make the DLL visible from the server in some way. Then run from a query window:

CREATE ASSEMBLY CLR_split FROM 'C:\somewhere\CLR_split.dll' go CREATE FUNCTION CLR_charlist_split(@list nvarchar(MAX), @delim nchar(1) = N',') RETURNS TABLE (str nvarchar(4000)) AS EXTERNAL NAME CLR_split.CLR_split.CLR_charlist_split go CREATE FUNCTION CLR_intlist_split(@list nvarchar(MAX)) RETURNS TABLE (number int) AS EXTERNAL NAME CLR_split.CLR_split.CLR_intlist_split go

(Note: it is also possible to deploy the functions from Visual Studio, but I'm not showing you that, because I don't know how to do it myself. Visual Studio leaves me in a maze, and at the same time I find the command-line very simple to use. What I have been told is that VS may require you to add extra attributes to the functions if you want to deploy the functions that way.)

You have now created the functions and can use them from T‑SQL. Here is an example for both:

CREATE PROCEDURE get_company_names_clr @customers nvarchar(2000) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN CLR_charlist_split(@customers, DEFAULT) s ON C.CustomerID = s.str go EXEC get_company_names_clr 'ALFKI, BONAP, CACTU, FRANK'

CREATE PROCEDURE get_product_names_clr @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN CLR_intlist_split(@ids) i ON P.ProductID = i.number go EXEC get_product_names_clr '9 12 27 37'

As with iter_intlist_to_tbl, CLR_intlist_split takes a space-separated list of integers.

What's Going On?

If you have never worked with CLR table functions before, you may at this point wonder how this all works, and I will try to explain.

CREATE ASSEMBLY loads the DLL into SQL Server. Note that it does not merely save a pointer to the file; the DLL as such is stored in the database. Since CREATE ASSEMBLY operates from SQL Server, the file path refers to the drives on the server, not on your local machine. (If you are loading the assembly from a network share, it's better to specify the location by \\servername name than by drive letter.) It is also possible to load an assembly as a hex-string.

The CREATE FUNCTION statements look just like the statements for creating multi-statement functions. That is, you specify the parameter list and the return table. But instead of a body, AS is followed by EXTERNAL NAME where you specify the CLR method to use. This is a three-part name where the first part is the assembly, the second part is a class within the assembly, and the last part is the name of the method itself. In this example, I'm using the same name for the assembly in SQL Server as I do for the class.

In the return table the SQL Server data types must match the CLR data types. Since the CLR only has Unicode strings, you can never use varchar in the return table, but you must use nvarchar. This means that when you work with a list of strings that you must always be careful to remember to convert the output to varchar when you join with varchar columns in tables, as I discussed in the section varchar vs. nvarchar.

There is one more small detail on the return table: For a multi-statement function you can specify that a column is nullable, you can define CHECK and DEFAULT constraints and define a PRIMARY KEY. This is not possible for CLR functions.

If we turn to the C# code, the table-valued function is implemented through two C# methods. The first method is the one that we point to in the CREATE FUNCTION statement. The second method is specified through the attribute that comes first in the definition of the CLR_charlist_split method. That is, this line:

[SqlFunction(FillRowMethodName="CharlistFillRow")]

This line specifies that the method is a table-valued function and points to the second method of the function. CLR_charlist_split is the entry point and is called once. The entry point must return a collection or an enumerator, and the CLR will call the method specified in FillRowMethodName once for every element in the collection/enumerator, and each invocation produces a row in the output table of the function.

So this is what happens when you call CLR_charlist_split from T‑SQL. The C# method calls the String method Split which splits the string into a collection over a delimiter. (For full details on Split, I refer you to the MSDN documentation.) Since you get a collection, you need do no more. The CLR calls CharlistFillRow for each element in the collection. And as I noted above, as soon as a row is produced, it can be consumed in the outer query, without waiting for the table function to complete.

What about parameters? As you may guess, the parameter list of the entry method must agree with the parameter list in the CREATE FUNCTION statement. The exact rules for mapping SQL data types to those of the CLR are beyond the scope of this text, please refer to Books Online for the full details.

The first parameter of the fill method (CharlistFillRow) is of the type object. This is the current element in the collection/​enumeration and to use it, you will need to cast it to the actual data type. The remaining parameters to the fill method are all output parameters, and they map to the columns in the output table in the CREATE FUNCTION statement.

One more thing calls for attention: the return type of the entry function. In this example it is IEnumerable, since Split returns a collection. The only other alternative is IEnumerator, which we will look at shortly.

Back on Track

After this excursion into CLR, let's get back to the topic of list-to-table functions. What are the characteristics of the list-to-table functions that use Split?

As you can see, we don't return the list position. As far as I know, you cannot get the list position this way, but I will have to admit that I have not dug into it. Overall, Split puts you quite much into a straight-jacket. You can be as flexible as Split is. That includes specifying an alternate string delimiter, which can be multi-character, and you can specify that empty elements should not be returned (which I make use of in the function for integers). But that's it.

There is another thing that is problematic with Split. As I said, it splits up the entire string into a collection, and that collection takes up memory. Of course, if you send in an occasional string with 10 elements that is not an issue. But if you have a busy website with 10 000 concurrent users? SQL Server MVPs Paul Nielsen and Adam Machanic have both reported they have tried to use CLR functions based on Split with disastrous results. If SQL Server comes under memory pressure, the AppDomain may be unloaded, which may cause a CLR function based on Split to crash. Or at least the AppDomain will have to be reloaded, which takes a couple of seconds. (If you are not acquainted with the term AppDomain, I will have to admit that it is a bit of mumbo-jumbo to me as well, but it is something that hosts the CLR execution. I kindly refer you to Books Online and the .NET documentation for details.) But in short: methods based on Split can lead to scalability and stability issues on a busy system.

Rolling Our Own in the CLR

Instead of relying on Split, we can do the work ourselves. The advantage with this is that we win flexibility and we also avoid the stability issues with Split. Here is a C# file that is longer than the previous one:

using System; using System.Collections; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; public class CLR_iter { private class stringiter : IEnumerator { string _str; char delim; int _start_ix; int _end_ix; int _listpos; public string str { get { return this._str; } } public int start_ix { get { return this._start_ix; } } public int end_ix { get { return this._end_ix; } } public int listpos { get { return this._listpos; } } public stringiter(SqlString str, SqlString delimiter) { this._str = str.IsNull ? "" : str.Value; this.delim = delimiter.IsNull ? '\0' : delimiter.Value.ToCharArray(0, 1)[0]; Reset(); } public bool MoveNext() { this._start_ix = this._end_ix + 1; if (delim == ' ') { while (this._start_ix < this._str.Length && this.str[this._start_ix] == ' ') { this._start_ix++; } } if (this._start_ix >= this._str.Length) { return false; } this._end_ix = this.str.IndexOf(this.delim, this._start_ix); this._listpos++; if (this.end_ix == -1) { this._end_ix = this._str.Length; } return true; } public Object Current { get { return this; } } public void Reset() { this._start_ix = -1; this._end_ix = -1; this._listpos = 0; } } [SqlFunction(FillRowMethodName="CharlistFillRow")] public static IEnumerator CLR_charlist_iter(SqlString str, SqlString delimiter) { return new stringiter(str, delimiter); } public static void CharlistFillRow(object obj, out int listpos, out string str) { stringiter iter = (stringiter) obj; listpos = iter.listpos; str = iter.str.Substring(iter.start_ix, iter.end_ix - iter.start_ix); str = str.Trim(); } [SqlFunction(FillRowMethodName="IntlistFillRow")] public static IEnumerator CLR_intlist_iter(SqlString str, SqlString delimiter) { return new stringiter(str, delimiter); } public static void IntlistFillRow(object obj, out int listpos, out int number) { stringiter iter = (stringiter) obj; listpos = iter.listpos; string str = iter.str.Substring(iter.start_ix, iter.end_ix - iter.start_ix); number = System.Convert.ToInt32(str); } }

The key is the internal class stringiter. First note the class declaration itself:

private class stringiter : IEnumerator

This means that the class implements the IEnumerator interface which is a requirement for a table-valued function. (That or IEnumerable.) Next follows internal class variables, and property methods to read these values from outside the class. Next piece of interest is the constructor:

public stringiter(SqlString str, SqlString delimiter) {

This method creates an instance of the stringiter class. This constructor is called from the entry-point method of the table-valued function.

Next follow MoveNext, Current and Reset. These are the methods that implement IEnumerator, and they must have precisely the names and signatures that you see above. (For more details on IEnumerator, see MSDN.) The interesting action goes on in MoveNext. It is here we look for the next list element and it is here we determine whether we are at the end of the list. As long as MoveNext returns true , the fill method of the table function will be called. That is, MoveNext should not return false when it finds the last element, but the next time round.

What is interesting from a list-to-table perspective is that MoveNext handles space as the separator in a special way: multiple spaces are collapsed into one. This does not happen with other delimiters.

After the code for the stringiter class come two entry-point methods with accompanying fill methods, one for strings and one for integers. In difference to the previous example with Split, the entry-points here return IEnumerator, since we implement IEnumerator ourselves. But similar to the Split example, all the entry-points do is to create a stringiter object.

The fill methods, finally, extracts the data from current stringiter object. I grab hold of the list position, so it can appear in the output table.

For completeness, here is the SQL declaration of the functions:

CREATE FUNCTION CLR_charlist_iter(@list nvarchar(MAX), @delim nchar(1) = ',') RETURNS TABLE (listpos int, str nvarchar(4000)) AS EXTERNAL NAME CLR_iter.CLR_iter.CLR_charlist_iter go CREATE FUNCTION CLR_intlist_iter(@list nvarchar(MAX), @delim nchar(1) = ' ') RETURNS TABLE (listpos int, number int) AS EXTERNAL NAME CLR_iter.CLR_iter.CLR_intlist_iter go

At first glance, it may seem that these functions do not give you that much more than Split. You get the list position, and the intlist function has a delimiter parameter, but most of all you seem to get a lot more code. But there is more than meet the eye. The above gives you a framework to work with if you need to handle more complex list formats or some other functionality not supported by the functions as given. Here, I have used the same enumerator class for strings and integers, but the fancier you want to be, it's probably better to use different classes.

What About Performance?

Overall, using a CLR function is one of the fastest methods around. Of the methods presented in this article, it is generally only the fixed-length method that we will look at later that can compete. To that comes table-valued parameters which I discuss in a separate article.

Since the CLR is fast, it's no wonder that there is on-going battle in the bloggosphere about finding the very fastest function. SQL Server MVP Adam Machanic presented what he thinks is the best so far in this blog post. Instead of using the String class and the IndexOf method he loops over a Char array and looks for the delimiters himself.

However, when I have run my performance tests, the performance of the different CLR functions I have tested have been virtually identical. I tested the two functions above, Adam's and one more which takes fixed-length input rather than a list of values (more about this format later in the article). Only at extremely long input, 50 000 list elements, I can sense a small advantage for the CLR_iter solution above, and with Adam's solution tending to be the slowest. But the difference is very small, maybe 20 milliseconds when the total execution time is half a second or more.

I should immediately qualify this. Adam's function handles multi-character delimiters, and mine doesn't. And to be able to handle multi-character delimiters Adam does the right thing. If you would take my functions above and just change the definition, you would also lose some performance directly. This is because when you pass IndexOf a Char parameter for the delimiter it is culture-insensitive. (Which I assume is .NET-speak for binary.) But if you instead pass the delimiter as a String – which you would need to for a multi-character delimiter – you will invoke an overload of IndexOf that is culture-sensitive, that is, you get the full Unicode overhead, which as I discussed above comes with a cost in performance.

There is nevertheless one performance flaw in Adam's function as posted: he only provides a function that returns strings, so if you have a list of integers, you need to apply convert in your T‑SQL code. When testing, I found that there was a small performance benefit of having the conversion done in the C# code than in the T‑SQL code. (This is not covered in the performance appendixes, because by the time I made that particular test, I had already compiled the results.)

As I have mentioned, in the later of my suite of tests, I included a multi-thread test to see how the various methods scale in a multi-user environment. I did not have the opportunity to run these tests on server-class hardware, and I have limited faith in the accuracy and the relevance of the results. However, the CLR methods emerged as clear winners: they were consistently among the methods that scaled the best. (Observer here that due to an issue in my test setup, I was not able to include table-valued parameters in the multithread tests.)

Interesting enough, the functions based on Split performed well in my tests even when I ran with 100 parallel threads and I did not see the issues that Adam and Paul have reported. I have a theory that these crashes were more likely to appear on 32-bit SQL Server where the full memory space beyond 2 GB was available only to the buffer cache, and the rest have to be shoehorned into the lower 2 GB. But that is mere speculation on my part. In any case, today 32-bit SQL Server is much a thing of the past.

CLR Strict Security

Background

The original model when the CLR was introduced was that an assembly could be labelled as SAFE, EXTERNAL_ACCESS or UNSAFE. A feature in .NET known as Code Access Security (CAS) guaranteed that a safe assembly operated only in the .NET realm and therefore could not affect the SQL Server engine or its environment. To create a safe assembly, you only needed database permissions, whereas unsafe and external-access assemblies required blessing on server level in any of the ways described below.

CAS ceased to be a security boundary in .NET with the release of .NET 4 which has been in use in SQL Server since SQL 2012. Thus, since this release it has been theoretically possible that an attacker could create an assembly which passes the checks for a safe assembly, but nevertheless performs unsafe operations which could threaten server stability or permit the attacker to run code with sysadmin permission. Nevertheless, it was not until 2017 that the SQL Server team reacted and introduced CLR Strict Security. When this setting is in force, all assemblies are considered unsafe and require blessing on server level.

CLR Strict Security was backported to SQL 2012, SQL 2014 and SQL 2016 (see KB article 4018930 for an exact list on in which builds it was added). In these versions, it is enabled by trace flag 6545 (which must be specified at startup; it cannot be controlled by DBCC TRACEON/TRACEOFF). On SQL 2017 there is a proper configuration option clr strict security as I have already showed you. This setting is on by default (unless the instance was upgraded from a previous version of SQL Server).

Keep in mind that if you are on a lower version of SQL Server than SQL 2017, CLR Strict Security may not affect you today, but most likely you will sooner or later upgrade to SQL 2017 or a later version and find that CLR Strict Security is in force. Thus, you should keep this in mind if you start using the CLR.

Loading an Unsafe Assembly

To load an unsafe assembly (which includes safe assemblies when CLR Strict Security is in force), there are three possible ways:

Compute a SHA_512 hash for the assembly and add this assembly as trusted with the system procedure sp_add_trusted_assembly . This procedure is only available in SQL 2017 or later. One way to do this is to load the DLL into SQL Server with OPENROWSET(BULK) and then use the hash_bytes function. I describe a procedure that performs this in the chapter Loading CLR Objects in the appendix to my article Packaging Permissions in Stored Procedures. (I should warn you that this chapter is by no means directed to the CLR beginner and also assumes that you have read the better part of the main article.) You can also compute the hash outside SQL Server. Sign the assembly with an asymmetric key and then load that key into the master database. Then you create a login for that key and grant that login the permission UNSAFE ASSEMBLY . The easiest way to sign the assembly is when you compile it: SN -k keypair.snk csc /target:library /keyfile:keypair.snk CLR_iter.cs To create the asymmetric key and grant permission, you run in master : CREATE ASYMMETRIC KEY CLR_iter_key FROM FILE = 'C:\SomePath\keypair.snk' ENCRYPTION BY PASSWORD = 'VeEeeRRy $tr0gn P@s$wörd' CREATE LOGIN CLR_iter_key_login FROM ASYMMETRIC KEY CLR_iter_key GRANT UNSAFE ASSEMBLY TO CLR_iter_key_login Note that this login is a special type of login which cannot actually log in. The database is marked as TRUSTWORTHY and the database owner has been granted the permission UNSAFE ASSEMBLY . (Which it has in the far too common case that sa owns the database). This is a setting I strongly recommend against, unless you understand the exact implications of TRUSTWORTHY and you have concluded that they do not apply to your environment. Casual use of TRUSTWORTHY can permit users with permissions to create users or assemblies in a database to elevate to sysadmin . I discuss this in more detail in my aritcle Packaging Permissions in Stored Procedures.

string_split – the Built-in Solution

SQL 2016 introduced a new built-in function, string_split which saves you from writing your own list-to-table function. This is certainly convenient, not the least if you are in the unfortunate situation that you cannot add your own user-defined functions to the database, but you have work with that is there. However, string_split is not that simple to use, not the least if you want to avoid performance surprises. Generally, I would say the function feels half-baked to me.

Here is an example of how to use it for an integer list:

CREATE PROCEDURE get_product_names_builtin @ids varchar(50) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN string_split(@ids, ',') s ON P.ProductID = convert(int, s.value) go EXEC get_product_names_builtin '9, 12, 27, 37'

Because string_split always return character data, you need to cast the return value to int (or some other numeric data type). It is true that some of the functions in the article also are of this kind, but given that numeric lists are so common, I think that when they added a built-in, they could have added a second one for numbers.

As you see from the example, you can specify a delimiter – in fact this is mandatory. Again, given how common comma-separated lists are, one like to think that this could have been made the default. The input string can be of any varchar and nvarchar length, including MAX, so from that point of view, nothing is missing. On the other hand, the delimiter can only be char(1) or nchar(1). That is, there is no support for multi-character delimiters. (Nor does any of the other functions described in this article support multi-character delimiter, but you can easily modify them if you have the need. But you cannot modify string_split.)

Here is an example with using string_split to crack a list of strings:

CREATE PROCEDURE get_company_names_builtin @customers nvarchar(MAX) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN string_split(@customers, ',') s ON C.CustomerID = convert(char(5), ltrim(s.value)) go EXEC get_company_names_builtin 'ALFKI, BONAP, CACTU, FRANK'

You may be surprised by the appearance of ltrim. and the explicit convert to char(5). The call to ltrim is necessary here, because string_split does not trim leading or trailing spaces, but goes blindly by the delimiter. So without ltrim I would not get BONAP, but BONA with a leading space. (I don't use rtrim here, since trailing spaces are ignored in string comparison in SQL anyway.) Many of the other functions in this article, performs the trimming for you.

Why convert then? The return type of the column value the same as the type for the input list. This choice is not illogical, since in theory the list could be a single element and there should be no truncation. But this means that if the input is (n)varchar(MAX) the type of value is also MAX, and as I discussed in the section MAX Types vs. Regular (n)varchar this introduces an implicit conversion of the table column which impairs performance. Whence, we need to use convert to prevent this from happening. Since it is not very common that you want to extract elements as long as 4000 characters, I think it would have been a better choice to limit the return type to never be more than 8000 bytes, no matter the input.

Another shortcoming of string_split is that it returns this single column value. That is, you cannot get the list position. There are many situations where you don't need this, but certainly often enough to find this lamentable. What's the point with a built-in function that you can only use sometimes?

When it comes to performance, the optimizer estimates string_split to always return 50 elements. That is, it does not consider the length of the input string, nor is string_split subject to interleaved execution in SQL 2017 like multi-statement table-valued functions are. (See the section Inline, Multi-Statement and Temp Tables above for a furhter discussion.)

I have not run any performance tests with this function, so I cannot say how well it fares, but I would expect it to do well, since there is no T‑SQL overhead. However, it has a go-slower button pressed. To wit, the delimiter is handled according to the current collation, which means that with a Windows collation you drag in the full Unicode comparison rules and the overhead that comes with them. You can avoid this by forcing a binary collation yourself, but then you are likely to get a collation conflict with the column you are comparing to. So to get the best performance you end up with:

CREATE PROCEDURE get_company_names_builtin @customers nvarchar(MAX) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN string_split(@customers, ',' COLLATE Slovenian_BIN2) s ON C.CustomerID = convert(char(5), ltrim(s.value)) COLLATE DATABASE_DEFAULT go EXEC get_company_names_builtin 'ALFKI, BONAP, CACTU, FRANK'

Note here that DATABASE_DEFAULT assumes that the C.CustomerID actually follows the database collation. That is, you must make sure that you use the collation of the column you are comparing to. If you force any other collation, any index on the column is dead and you may be in a performance disaster because of the scan. (The reason you would prefer DATABASE_DEFAULT over the actual collation is of course to avoid any hard-coding.)

By now the usage of string_split is so complex, that the benefit of that the function is built-in is almost lost.

Then again, you could wrap it in your own functions to overcome most of the problems I have discussed here:

CREATE FUNCTION string_split_wrapper (@list nvarchar(MAX)) RETURNS @t TABLE (value nvarchar(4000) NOT NULL) AS BEGIN INSERT @t (value) SELECT convert(nvarchar(4000), ltrim(rtrim(value))) FROM string_split(@list, ',' COLLATE Slovenian_BIN2) RETURN END go CREATE OR ALTER PROCEDURE get_company_names_wrapper @customers nvarchar(MAX) AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN dbo.string_split_wrapper(@customers)s ON C.CustomerID = s.value go EXEC get_company_names_wrapper 'ALFKI, BONAP, CACTU, FRANK'

You would do a similar function for numeric lists. You get then benefits of interleaved execution in SQL 2017, and functionally it is in par with many of the functions in this aritcle, and I would expect it to perform better than most of them. You would still not be able to get the list position. If you are the brave sort of person you could add an IDENTITY column to the return table or something like

row_number() OVER(ORDER BY (SELECT (NULL))

but if it would seem to work, it would only be by chance, and there would be no guarantees.

XML and JSON

Rather than passing a comma-separated list, you can create an XML document to pass your list of values. Undeniably, this is more complex, so in most cases you may not find it worth the trouble as long it's only a list of single values. Where XML really shines is when you need to insert many rows. You transform your data to an XML document in the client, and then you unpack it into your tables with the xml type methods. Until table-valued parameters made the scene in SQL 2008, XML was the best method to pass a set of data from an application.

If you are on SQL 2016 or later, you can instead use JSON, which I will discuss very briefly here.

Using XML for a List of Values

Let's go straight to the matter. Here are our get_company_names and get_product_names using XML.

CREATE PROCEDURE get_company_names_xml @customers xml AS SELECT C.CustomerID, C.CompanyName FROM Northwind..Customers C JOIN @customers.nodes('/Root/Customer') AS T(Item) ON C.CustomerID = T.Item.value('@custid[1]', 'nchar(5)') go EXEC get_company_names_xml N'<Root><Customer custid="ALFKI"/> <Customer custid="BONAP"/> <Customer custid="CACTU"/> <Customer custid="FRANK"/> </Root>' go CREATE PROCEDURE get_product_names_xml @ids xml AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN @ids.nodes('/Root/Num') AS T(Item) ON P.ProductID = T.Item.value('@num[1]', 'int') go EXEC get_product_names_xml N'<Root><Num num="9"/><Num num="12"/> <Num num="27"/><Num num="37"/></Root>'

The two xml type methods we use are nodes and value. nodes is a rowset function that returns a one-column table where each row is an XML fragment for the given path. That is, in the two examples, you get one row for each Customer or Num node. As with derived tables, you must specify an alias for the return table. You must also specify a name for the column that nodes produces. (You can name columns in derived tables in this way too, but it's not that common to do so.)

The sole operation you can perform on T.Item is to employ any of the four xml type methods: exists, query, nodes and value. Of interest to us here is value. This method extracts a single value from the XML fragment. value takes two parameters: The first is a node specification for a single element or attribute. In this case we want an attribute – custid and num respectively – which is why we specify with @ in front. (Without the @ it would be an element specification.) As for the meaning of [1], I will come back that. The second argument to value is the T‑SQL data type for the return value.

Creating the XML Document

Normally, you create the XML document client side using some suitable library for the purpose, for instance the XmlWriter class in .NET. You should never attempt to create XML documents by just concatenating strings, since many characters needs to be encoded when you put them in XML. For instance, the ampersand character (&) is encoded as & in XML.

That said, for a list of numbers there are not really many things that can go wrong. So here is a procedure that receives a comma-separated list and transforms it into XML and then shreds the XML to get the values:

CREATE PROCEDURE get_product_names_xmlcsv @csv varchar(100) AS DECLARE @xml xml = '<Root><Num num="' + replace(convert(varchar(MAX), @csv), ',' COLLATE Slovenian_BIN2, '"/><Num num="') + '"/></Root>' SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN @xml.nodes('/Root/Num') AS T(Item) ON P.ProductID = T.Item.value('@num[1]', 'int') go EXEC get_product_names_xmlcsv '9, 12, 27, 37'

At first glance, this may seem more cute than useful, but if you are in the dire situation that you cannot create functions in the database, this is not a bad solution at all. After all, it is quite compact. The reason I cast @csv to varchar(MAX) is that if @csv is regular varchar, so is the return type from replace. If the input is close to 8000 characters, truncation could occur.

But please don't ever do this for a list of strings – this is bound to end in tears the day when there are XML special characters in the data.

Performance

Just like with the CLR, the data from nodes streams into the rest of the query, so XML is also an inline method. But since parsing XML is more complex than parsing comma-separated lists, it's slower. In my tests, the execution times for XML are 40-60 % higher than for the CLR, but it is twice as fast as the iterative method. In fact, XML is the fastest method that does not require any preparations in the server: you don't have to activate the CLR and create a function, nor do you need to define a table type for a table-valued parameter.

You may guess that there is a higher performance cost for passing an XML document, and indeed there is. In my tests I measured this to be 20-25 ms for 10 000 list elements, to compare with 5-7 ms for most other methods. It's hardly a cause for alarm.

Note: these observations applies to case when the XML document is passed from the client. I have not run any performance tests with the pattern in get_product_names_xmlcsv, since I added that example in the revision in 2016. But the only extra cost is the cost for replace, which I've tried to minimize with forcing a binary collation.

That said, it shall not be denied that there are certainly problems with performance with the xml type methods and XQuery in SQL Server. When the optimizer gets it right, performance is often very good. But if the optimizer gets it wrong, performance can be very bad. From one point of view, this is not entirely surprising: the optimizer has very little idea of what is in that XML document, and rather than using estimates based on data, it relies on standard assumptions. Unfortunately, one cannot escape the feeling that the optimization of XML queries is an unfinished chapter in SQL Server. Sometimes, seemingly meaningless changes can have a big impact on performance. Above I wrote the expression to extract a number from the XML document as:

T.Item.value('@num[1]', 'int')

But I did not say what [1] means. The answer is that it means – nothing! @num and @num[1] means the same thing: retrieve the attribute num in the current context node. Unfortunately, SQL Server does not seem to understand this, and will generate different query plans depending on [1] is there or not. In many cases the plan with only @num will be a teeny-weeny more efficient, because it has fewer operators. But because of its misperception of what @num implies, the optimizer sometimes produces a really bad plan. This is less likely to happen if you add [1]. Thus, my recommendation is that you always add [1] to address attributes, although there is no logical reason why you should have to. (And don't be surprised if you find queries where removing [1] gives you a vast performance improvement. Query plans when XQuery is involved sometimes feel very much like a lottery.)

One practical conclusion of this is that you should be restrictive to entangle the shredding of the XML document with the rest of your data. It may be better to keep it simple and insert the data into a temp table and take benefit of that temp tables have statistics.

List Position

I thought for a long time that you cannot the retrieve list position from an XML document, as there is no xml type method that returns this information. Then SQL Server MVP Alejandro Mesa reminded me that you can try:

row_number() OVER(ORDER BY T.c)

That is, you order by the XML fragment produced by the nodes method. Digging into the query plan, and looking at the Sort operator, I find that it sorts by this expression:

[XML Reader with XPath filter].id

Looks good, doesn't it? But beware! This is not documented, and this could be something that works only by chance. (That would not be the first time in the history of SQL Server when it comes to ordering.) If you look at it from a logical view what would ordering by an XML fragment mean? That is just nonsense! Furthermore, if you try:

SELECT ... ORDER BY T.c

you will be told that this is not legal. I suspect the reason that it is accepted in the OVER() clause is due to an oversight. Thus, my recommendation is that you stay away from numbering by the context node.

Thankfully, there is a fully supported alternative, and again credit goes to Alejandro Mesa for finding a clever solution. But to explain it, we need to dig a little deeper into XPath, XQuery and the xml type methods. We also need another device that I will introduce in the next chapter. So I will return to the numbering of XML documents in the section XML Revisited, later in this article.

Inserting Many Rows

I said that where XML is really good is when you need to insert many rows. So how would you do that? In fact you have already seen the basics. You use nodes and value. Here is an example where I unpack a document with orders and order details:

DECLARE @x xml SELECT @x = N'<Orders> <Order OrderID="13000" CustomerID="ALFKI" OrderDate="2006-09-20Z" EmployeeID="2"> <OrderDetails ProductID="76" Price="123" Qty = "10"/> <OrderDetails ProductID="16" Price="3.23" Qty = "20"/> </Order> <Order OrderID="13001" CustomerID="VINET" OrderDate="2006-09-20Z" EmployeeID="1"> <OrderDetails ProductID="12" Price="12.23" Qty = "1"/> </Order> </Orders>' SELECT OrderID = T.Item.value('@OrderID[1]', 'int'), CustomerID = T.Item.value('@CustomerID[1]', 'nchar(5)'), OrderDate = T.Item.value('@OrderDate[1]', 'datetime'), EmployeeId = T.Item.value('@EmployeeID[1]', 'smallint') FROM @x.nodes('/Orders/Order') AS T(Item)

SELECT OrderID = A.Item.value('@OrderID[1]', 'int'), ProductID = B.Item.value('@ProductID[1]', 'smallint'), Price = B.Item.value('@Price[1]', 'decimal(10,2)'), Qty = B.Item.value('@Qty[1]', 'int') FROM @x.nodes('/Orders/Order') AS A(Item) CROSS APPLY A.Item.nodes('OrderDetails') AS B (Item)

As you see, the document can include data for several tables, and extracting the order-header information is straight-forward. We use nodes to get data from that level, and then we pick the attributes with value.

Extracting the details rows is a little more interesting, because beside the information about product, quantity and price, we also need to get the order ID which is in the node above. The way to approach this is to first run nodes on the Order level, and use the CROSS APPLY operator to run nodes on the details for each order. (Recall that CROSS APPLY is like a JOIN, but which can take arguments from the left side in difference to a normal join.)

This latter may sound a little inefficient, so you may be tempted to try this instead:

SELECT OrderID = T.Item.value('../@OrderID[1]', 'int'), ProductID = T.Item.value('@ProductID[1]', 'smallint'), Price = T.Item.value('@Price[1]', 'decimal(10,2)'), Qty = T.Item.value('@Qty[1]', 'int') FROM @x.nodes('/Orders/Order/OrderDetails') AS T (Item)

That is, you use the parent-axis notation to retrieve the order ID. But as I mentioned, XQuery optimizing is not always the best in SQL Server, and parent access is a sore area. When I have tested, the above has worked well, but only if I included that [1] which does not mean anything. With only ../@OrderID, I got a very bad query plan; in a test I ran, it took 10-30 seconds to extract 2155 details rows from 830 orders, whereas when written as a above, execution time was subsecond. I have reported the poor performance in this feedback item. Microsoft's response is that they recommend against using parent-axis notation, in favour of CROSS APPLY. And even if I was able to get good performance when I added the [1], I cannot guarantee that you will get a good query plan. The solution with CROSS APPLY seems to be more stable, which is why you should stick with it.

Element-centric XML

What you have seen above is attribute-centric XML. You can also use element-centric XML, where the XML document for product IDs would look like this:

<Root><Num>9</Num><Num>12</Num><Num>27</Num><Num>37</Num></Root>

For the first parameter to value you could give a single period to denote the current element to extract the values:

SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN @ids.nodes('/Root/Num') AS T(Item) ON P.ProductID = T.Item.value('.', 'int')

That looks really simple, but as you might guess from the fact I've put it red, you should not use this. When this works well, you can count on an overhead of 30 % compared to the query for attribute-centric XML. But it can also work really badly, particularly if the SELECT is part of an INSERT statement. I have experienced that it could take four minutes to unpack a document with 10 000 numbers. This may have been a bug that has been fixed, as I am no longer able to repro the problem. Nevertheless, there is a way to get better and more predictable performance from element-centred XML that I learnt from SQL Server MVP Alejandro Mesa:

SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN @ids.nodes('/Root/Num') AS T(Item) ON P.ProductID = T.Item.value('(./text())[1]', 'int')

It sure is a strain on the eye, but this more complex expression has three advantages: 1) It is not subject to the bug that can cause bad query plans. 2) Execution time is in par with attribute-centric XML, or even somewhat better. 3) Empty elements are interpreted as NULL.

You may ask what that text() does, and if it has a real meaning, or if it something equally silly like that [1]. The answer is that text() is for real. Consider this little batch:

DECLARE @x xml = '<Root><Sub>One<Subsub>Two</Subsub>Three<Subsub>Four</Subsub></Sub></Root>' SELECT T.c.value('.', 'nvarchar(20)') AS Dot, T.c.value('(./text())[1]', 'nvarchar(20)') AS Text FROM @x.nodes('Root/Sub') AS T(c)

The output is:

Dot Text -------------------- -------------------- OneTwoThreeFour One

That is, with '.' you get all text that is in the Sub node, including the Subsub nodes. With text(), which is not a function but said to be a node test, you only get the text outside the Subsub nodes. Furthermore, you are compelled to specify which of texts you want; that is, [] is compulsory. You can use [2] in place of [1] in which case you get back Three, the second data value in the Sub element. Certainly, this XML document is a bit wretched, and for most XML documents in this world there is no practical difference between a mere '.' and the complex '(./text())[1]', but SQL Server has to assume the worst about the XML document you feed it, and this is why using only '.' will always be slower.

This example also displays the difference between attributes and elements. An attribute can only appear once within a node, and it cannot contain other elements or attributes, which is why attributes are quite a simpler affair. But it seems that SQL Server does not realise this, but the query plans for attributes are very similar for those for element-centric XML. Presumably this explains why it may matter to add [1] to attributes.

Above I said that an advantage with text() is that empty elements are interpreted as NULL. In case you wonder what I mean, consider:

DECLARE @x xml SELECT @x = '<Root><Value></Value></Root>' SELECT T.c.value('.', 'int') AS "without", T.c.value('(./text())[1]', 'int') AS "with text" FROM @x.nodes('/Root/Value') AS T(c)

The output from this is:

without with text ----------- ----------- 0 NULL

That is, if an element is present, but have no actual text, you get a non-NULL value if you only use '.', which can yield unexpected results as demonstrated in this example, whereas with '(./text())[1]' you get back NULL which is more likely to be what you want.

Inserting Many Rows from Element-centric Documents

In the examples with a list of values, the element-centric document is considerably simpler and shorter. But that is an exception. Normally, element-centric documents are far more verbose than the corresponding attribute-centric document. Here is an element-centric version of the order document above:

<Orders> <Order> <OrderID>13000</OrderID> <CustomerID>ALFKI</CustomerID> <OrderDate>2006-09-20Z</OrderDate> <EmployeeID>2</EmployeeID> <OrderDetails> <ProductID>76</ProductID> <Price>123</Price> <Qty>10</Qty> </OrderDetails> <OrderDetails> <ProductID>16</ProductID> <Price>3.23</Price> <Qty>20</Qty> </OrderDetails> </Order> <Order> <OrderID>13001</OrderID> <CustomerID>VINET</CustomerID> <OrderDate>2006-09-20Z</OrderDate> <EmployeeID>1</EmployeeID> <OrderDetails> <ProductID>12</ProductID> <Price>12.23</Price> <Qty>1</Qty> </OrderDetails> </Order> </Orders>

Here is how you would unpack it:

SELECT OrderID = T.Item.value('(OrderID/text())[1]', 'int'), CustomerID = T.Item.value('(CustomerID/text())[1]', 'nchar(5)'), OrderDate = T.Item.value('(OrderDate/text())[1]', 'datetime'), EmployeeId = T.Item.value('(EmployeeID/text())[1]', 'smallint') FROM @x.nodes('Orders/Order') AS T(Item) SELECT OrderID = O.n.value('(OrderID/text())[1]', 'int'), ProductID = D.n.value('(ProductID/text())[1]', 'int'), Price = D.n.value('(Price/text())[1]', 'decimal(10,2)'), Qty = D.n.value('(Qty/text())[1]', 'int') FROM @x.nodes('/Orders/Order') AS O(n) CROSS APPLY O.n.nodes('OrderDetails') AS D(n)

In this case, too, you need to use text() to get best performance. As with attribute-centric XML, the best way to get the order ID together with the details rows, is to first run nodes on Order level, and then descend to the details level for each order. Using parent-notation as in ../OrderID can easily yield very poor performance with Microsoft's unfinished XQuery implementation.

Typed XML

So far we have only looked at untyped XML, but it is possible to define an XML schema with the command CREATE XML SCHEMA COLLECTION. When you define a variable or column, you can say xml(mycollection) rather than just xml. SQL Server will then validate that the document adheres to that schema. This is known as typed XML. Using typed XML can have some positive impact on performance, since the schema informs the optimizer that some things cannot occur. For instance, this statement defines how the XML document with a list of values could look like:

CREATE XML SCHEMA COLLECTION elemnum AS '<?xml version="1.0" encoding="utf-8"?> <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="Root"> <xs:complexType> <xs:sequence> <xs:element maxOccurs="unbounded" name="Num" type="xs:int" /> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>'

(I'm afraid it's completely outside the scope of this article to explain this. Unfortunately Books Online does not really describe this either, but XSD is a standard defined by W3C.) Using this schema, we can write:

DECLARE @x(elemnum) SELECT @x = N'<Root><Num>9</Num><Num>12</Num><Num>27</Num><Num>37</Num></Root>'

SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN @ids.nodes('/Root/Num') AS T(Item) ON P.ProductID = T.Item.value('.', 'int')

You may note that this time I did not put the use of '.' in red this time. The XML schema states that a Num node includes one single number and that's all, so SQL Server does not need to consider the situation with text and subnodes interleaved, as in the example above. For this reason text() is not needed – in fact text() is not even permitted with typed XML.

This performs well, and in some tests I was able to achieve significantly better performance than with untyped XML. But most of the time the schema appears to incur an overhead, and not the least the validation to pass data to a typed XML parameter is quite expensive: 130-170 ms for 10 000 integers in my tests. Or about the same time it takes for the fastest methods to process that many elements at all. (And when I used typed element-centric XML in an EXISTS query, the result was a disaster – more than two minutes to find 20 items in my test table. This beats all those inline functions with a wide margin.)

Thus, I would not consider using typed XML for a a list of scalar values to be a serious option. What the schema buys you is protection against errors like misspelling an element or attribute name. But for a list of values you will probably find such errors quickly anyway. On the other hand, if you are using XML to pass more complex data sets around, using a schema can be worthwhile, not the least if different teams are involved.

OPENXML

Already in SQL 2000 you could use XML thanks to the OPENXML function. OPENXML is still around in SQL 2017, but about the only reason to use it would be that you still need to support SQL 2000. OPENXML is bulkier to use, and performance is nowhere near nodes. In fact, in my tests it's 30-50 % slower than the iterative method. One possible advantage with OPENXML, though, is that the query plan is simpler. Query plans for XQuery mixes XML operators with relational operators, whereas OPENXML results in one single Remote Scan. This reduces the risk for the query-plan disasters you sometimes get with XQuery.

If you want to see an example of OPENXML, please refer to the SQL 2000 version of this article.

JSON

Before we leave this chapter, a little about JSON. Support for JSON was added in SQL 2016 on, as they say, popular request. There is no JSON data type, but there is an OPENJSON and a few more JSON functions as well as a FOR JSON operator.

Just like you can transform your comma-separated list of numbers to XML, you can transform it to JSON. There is one noticeable difference: it is a lot simpler. Here is an example:

CREATE PROCEDURE get_product_names_json @csv varchar(100) AS SELECT P.ProductName, P.ProductID FROM Northwind..Products P JOIN OPENJSON('[' + @csv + ']') j ON convert(int, j.value) = P.ProductID go EXEC get_product_names_json '9, 12, 27, 37'

All you need to do is wrap the list in brackets. Well, since the column value always is of the type nvarchar(MAX), you need to cast it back to int.

I would not recommend doing this for list of strings, as I assume that there could be accidents if there are characters special to JSON in the data, but I don't know JSON myself to really say this with authority.

It is worth noting that the result set from OPENJSON includes a column key, which for a list above returns the list position starting on 0. Just beware that the datatype of key is nvarchar(4000) not int, as with other JSON documents it can hold string data.

I have not tested OPENJSON for performance, but I note that the estimate appears to always be 50 rows.

Since a JSON document just like XML can include data for several fields and records and they can be nested, it lends itself for inserting many rows. You probably would not create a JSON document out of another data source, not the least when you can use a table-valued parameter. On the other hand, if you already have the data as JSON, it makes sense to pass it to SQL Server as-is.

Using a Table of Numbers

The Concept

This is the fastest way to unpack a comma-separated list of numbers in pure T‑SQL. That is, this is the section for you who want a fast function, but feel the CLR is not for you and nor do you want to change the format of the string.

The trick is to use an auxiliary table of numbers: a one-column table with numbers from 1 and up. Here is how you can create a table with numbers from 1 to 999 999:

CREATE TABLE Numbers (Number int NOT NULL PRIMARY KEY); WITH digits (d) AS ( SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 0) INSERT Numbers (Number) SELECT Number FROM (SELECT i.d + ii.d * 10 + iii.d * 100 + iv.d * 1000 + v.d * 10000 + vi.d * 100000 AS Number FROM digits i CROSS JOIN digits ii CROSS JOIN digits iii CROSS JOIN digits iv CROSS JOIN digits v CROSS JOIN digits vi) AS Numbers WHERE Number > 0

First I set up a CTE to define a virtual table that has all digits from 0 to 9, and then I create a Cartesian product for as many powers of ten I want numbers in my table.

Before I go on and show how to use this table to unpack a comma-separated list, I like to point out that this sort of table has a much wider usage. Just to give you a taste of it, I will quickly two show examples. Say you have a long string where you want to count the frequency of the various characters. You can use Numbers to "loop" over the string to get the characters in tabular format:

DECLARE @str nvarchar(MAX) SELECT @str = 'Insert your favourite novel here.' SELECT substring(@str, Number, 1) AS ch, COUNT(*) FROM Numbers WHERE Number <= len(@str) GROUP BY substring(@str, Number, 1) ORDER BY ch

If this example is little on the esoteric side, the next is definitely a common one. You need to return the number of orders per day for a period, say a month. If there are days without any orders they should be included in the output, showing a zero for the count. To achieve this, you need something that spans the entire period for you, and then left join the period with the Orders table. This is exactly what Num